thr3ads.net - Gluster users - [Gluster-users] Upgrade 10.4 -> 11.1 making problems [Jan 2024]

If this information is useful, please help other people find it:
Share via:

Hu Bert

2024-Jan-18 06:11 UTC

[Gluster-users] Upgrade 10.4 -> 11.1 making problems

Good morning,
heal still not running. Pending heals now sum up to 60K per brick.
Heal was starting instantly e.g. after server reboot with version
10.4, but doesn't with version 11. What could be wrong?

I only see these errors on one of the "good" servers in
glustershd.log:

[2024-01-18 06:08:57.328480 +0000] W [MSGID: 114031]
[client-rpc-fops_v2.c:2561:client4_0_lookup_cbk] 0-workdata-client-0:
remote operation failed.
[{path=<gfid:cb39a1e4-2a4c-4727-861d-3ed9ef00681b>},
{gfid=cb39a1e4-2a4c-4727-861d-3ed9e
f00681b}, {errno=2}, {error=No such file or directory}]
[2024-01-18 06:08:57.594051 +0000] W [MSGID: 114031]
[client-rpc-fops_v2.c:2561:client4_0_lookup_cbk] 0-workdata-client-1:
remote operation failed.
[{path=<gfid:3e9b178c-ae1f-4d85-ae47-fc539d94dd11>},
{gfid=3e9b178c-ae1f-4d85-ae47-fc539
d94dd11}, {errno=2}, {error=No such file or directory}]

About 7K today. Any ideas? Someone?


Best regards,
Hubert

Am Mi., 17. Jan. 2024 um 11:24 Uhr schrieb Hu Bert <revirii at
googlemail.com>:>
> ok, finally managed to get all servers, volumes etc runnung, but took
> a couple of restarts, cksum checks etc.
>
> One problem: a volume doesn't heal automatically or doesn't heal at
all.
>
> gluster volume status
> Status of volume: workdata
> Gluster process                             TCP Port  RDMA Port  Online 
Pid
>
------------------------------------------------------------------------------
> Brick glusterpub1:/gluster/md3/workdata     58832     0          Y      
3436
> Brick glusterpub2:/gluster/md3/workdata     59315     0          Y      
1526
> Brick glusterpub3:/gluster/md3/workdata     56917     0          Y      
1952
> Brick glusterpub1:/gluster/md4/workdata     59688     0          Y      
3755
> Brick glusterpub2:/gluster/md4/workdata     60271     0          Y      
2271
> Brick glusterpub3:/gluster/md4/workdata     49461     0          Y      
2399
> Brick glusterpub1:/gluster/md5/workdata     54651     0          Y      
4208
> Brick glusterpub2:/gluster/md5/workdata     49685     0          Y      
2751
> Brick glusterpub3:/gluster/md5/workdata     59202     0          Y      
2803
> Brick glusterpub1:/gluster/md6/workdata     55829     0          Y      
4583
> Brick glusterpub2:/gluster/md6/workdata     50455     0          Y      
3296
> Brick glusterpub3:/gluster/md6/workdata     50262     0          Y      
3237
> Brick glusterpub1:/gluster/md7/workdata     52238     0          Y      
5014
> Brick glusterpub2:/gluster/md7/workdata     52474     0          Y      
3673
> Brick glusterpub3:/gluster/md7/workdata     57966     0          Y      
3653
> Self-heal Daemon on localhost               N/A       N/A        Y      
4141
> Self-heal Daemon on glusterpub1             N/A       N/A        Y      
5570
> Self-heal Daemon on glusterpub2             N/A       N/A        Y      
4139
>
> "gluster volume heal workdata info" lists a lot of files per
brick.
> "gluster volume heal workdata statistics heal-count" shows
thousands
> of files per brick.
> "gluster volume heal workdata enable" has no effect.
>
> gluster volume heal workdata full
> Launching heal operation to perform full self heal on volume workdata
> has been successful
> Use heal info commands to check status.
>
> -> not doing anything at all. And nothing happening on the 2
"good"
> servers in e.g. glustershd.log. Heal was working as expected on
> version 10.4, but here... silence. Someone has an idea?
>
>
> Best regards,
> Hubert
>
> Am Di., 16. Jan. 2024 um 13:44 Uhr schrieb Gilberto Ferreira
> <gilberto.nunes32 at gmail.com>:
> >
> > Ah! Indeed! You need to perform an upgrade in the clients as well.
> >
> >
> >
> >
> >
> >
> >
> >
> > Em ter., 16 de jan. de 2024 ?s 03:12, Hu Bert <revirii at
googlemail.com> escreveu:
> >>
> >> morning to those still reading :-)
> >>
> >> i found this:
https://docs.gluster.org/en/main/Troubleshooting/troubleshooting-glusterd/#common-issues-and-how-to-resolve-them
> >>
> >> there's a paragraph about "peer rejected" with the
same error message,
> >> telling me: "Update the cluster.op-version" - i had only
updated the
> >> server nodes, but not the clients. So upgrading the
cluster.op-version
> >> wasn't possible at this time. So... upgrading the clients to
version
> >> 11.1 and then the op-version should solve the problem?
> >>
> >>
> >> Thx,
> >> Hubert
> >>
> >> Am Mo., 15. Jan. 2024 um 09:16 Uhr schrieb Hu Bert <revirii at
googlemail.com>:
> >> >
> >> > Hi,
> >> > just upgraded some gluster servers from version 10.4 to
version 11.1.
> >> > Debian bullseye & bookworm. When only installing the
packages: good,
> >> > servers, volumes etc. work as expected.
> >> >
> >> > But one needs to test if the systems work after a daemon
and/or server
> >> > restart. Well, did a reboot, and after that the
rebooted/restarted
> >> > system is "out". Log message from working node:
> >> >
> >> > [2024-01-15 08:02:21.585694 +0000] I [MSGID: 106163]
> >> >
[glusterd-handshake.c:1501:__glusterd_mgmt_hndsk_versions_ack]
> >> > 0-management: using the op-version 100000
> >> > [2024-01-15 08:02:21.589601 +0000] I [MSGID: 106490]
> >> >
[glusterd-handler.c:2546:__glusterd_handle_incoming_friend_req]
> >> > 0-glusterd: Received probe from uuid:
> >> > b71401c3-512a-47cb-ac18-473c4ba7776e
> >> > [2024-01-15 08:02:23.608349 +0000] E [MSGID: 106010]
> >> > [glusterd-utils.c:3824:glusterd_compare_friend_volume]
0-management:
> >> > Version of Cksums sourceimages differ. local cksum =
2204642525,
> >> > remote cksum = 1931483801 on peer gluster190
> >> > [2024-01-15 08:02:23.608584 +0000] I [MSGID: 106493]
> >> > [glusterd-handler.c:3819:glusterd_xfer_friend_add_resp]
0-glusterd:
> >> > Responded to gluster190 (0), ret: 0, op_ret: -1
> >> > [2024-01-15 08:02:23.613553 +0000] I [MSGID: 106493]
> >> > [glusterd-rpc-ops.c:467:__glusterd_friend_add_cbk]
0-glusterd:
> >> > Received RJT from uuid: b71401c3-512a-47cb-ac18-473c4ba7776e,
host:
> >> > gluster190, port: 0
> >> >
> >> > peer status from rebooted node:
> >> >
> >> > root at gluster190 ~ # gluster peer status
> >> > Number of Peers: 2
> >> >
> >> > Hostname: gluster189
> >> > Uuid: 50dc8288-aa49-4ea8-9c6c-9a9a926c67a7
> >> > State: Peer Rejected (Connected)
> >> >
> >> > Hostname: gluster188
> >> > Uuid: e15a33fe-e2f7-47cf-ac53-a3b34136555d
> >> > State: Peer Rejected (Connected)
> >> >
> >> > So the rebooted gluster190 is not accepted anymore. And thus
does not
> >> > appear in "gluster volume status". I then followed
this guide:
> >> >
> >> >
https://gluster-documentations.readthedocs.io/en/latest/Administrator%20Guide/Resolving%20Peer%20Rejected/
> >> >
> >> > Remove everything under /var/lib/glusterd/ (except
glusterd.info) and
> >> > restart glusterd service etc. Data get copied from other
nodes,
> >> > 'gluster peer status' is ok again - but the volume
info is missing,
> >> > /var/lib/glusterd/vols is empty. When syncing this dir from
another
> >> > node, the volume then is available again, heals start etc.
> >> >
> >> > Well, and just to be sure that everything's working as it
should,
> >> > rebooted that node again - the rebooted node is kicked out
again, and
> >> > you have to restart bringing it back again.
> >> >
> >> > Sry, but did i miss anything? Has someone experienced similar
> >> > problems? I'll probably downgrade to 10.4 again, that
version was
> >> > working...
> >> >
> >> >
> >> > Thx,
> >> > Hubert
> >> ________
> >>
> >>
> >>
> >> Community Meeting Calendar:
> >>
> >> Schedule -
> >> Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
> >> Bridge: https://meet.google.com/cpu-eiue-hvk
> >> Gluster-users mailing list
> >> Gluster-users at gluster.org
> >> https://lists.gluster.org/mailman/listinfo/gluster-users

Diego Zuccato

2024-Jan-18 07:28 UTC

head link

[Gluster-users] Upgrade 10.4 -> 11.1 making problems

That's the same kind of errors I keep seeing on my 2 clusters, 
regenerated some months ago. Seems a pseudo-split-brain that should be 
impossible on a replica 3 cluster but keeps happening.
Sadly going to ditch Gluster ASAP.

Diego

Il 18/01/2024 07:11, Hu Bert ha scritto:> Good morning,
> heal still not running. Pending heals now sum up to 60K per brick.
> Heal was starting instantly e.g. after server reboot with version
> 10.4, but doesn't with version 11. What could be wrong?
> 
> I only see these errors on one of the "good" servers in
glustershd.log:
> 
> [2024-01-18 06:08:57.328480 +0000] W [MSGID: 114031]
> [client-rpc-fops_v2.c:2561:client4_0_lookup_cbk] 0-workdata-client-0:
> remote operation failed.
> [{path=<gfid:cb39a1e4-2a4c-4727-861d-3ed9ef00681b>},
> {gfid=cb39a1e4-2a4c-4727-861d-3ed9e
> f00681b}, {errno=2}, {error=No such file or directory}]
> [2024-01-18 06:08:57.594051 +0000] W [MSGID: 114031]
> [client-rpc-fops_v2.c:2561:client4_0_lookup_cbk] 0-workdata-client-1:
> remote operation failed.
> [{path=<gfid:3e9b178c-ae1f-4d85-ae47-fc539d94dd11>},
> {gfid=3e9b178c-ae1f-4d85-ae47-fc539
> d94dd11}, {errno=2}, {error=No such file or directory}]
> 
> About 7K today. Any ideas? Someone?
> 
> 
> Best regards,
> Hubert
> 
> Am Mi., 17. Jan. 2024 um 11:24 Uhr schrieb Hu Bert <revirii at
googlemail.com>:
>>
>> ok, finally managed to get all servers, volumes etc runnung, but took
>> a couple of restarts, cksum checks etc.
>>
>> One problem: a volume doesn't heal automatically or doesn't
heal at all.
>>
>> gluster volume status
>> Status of volume: workdata
>> Gluster process                             TCP Port  RDMA Port  Online
Pid
>>
------------------------------------------------------------------------------
>> Brick glusterpub1:/gluster/md3/workdata     58832     0          Y     
3436
>> Brick glusterpub2:/gluster/md3/workdata     59315     0          Y     
1526
>> Brick glusterpub3:/gluster/md3/workdata     56917     0          Y     
1952
>> Brick glusterpub1:/gluster/md4/workdata     59688     0          Y     
3755
>> Brick glusterpub2:/gluster/md4/workdata     60271     0          Y     
2271
>> Brick glusterpub3:/gluster/md4/workdata     49461     0          Y     
2399
>> Brick glusterpub1:/gluster/md5/workdata     54651     0          Y     
4208
>> Brick glusterpub2:/gluster/md5/workdata     49685     0          Y     
2751
>> Brick glusterpub3:/gluster/md5/workdata     59202     0          Y     
2803
>> Brick glusterpub1:/gluster/md6/workdata     55829     0          Y     
4583
>> Brick glusterpub2:/gluster/md6/workdata     50455     0          Y     
3296
>> Brick glusterpub3:/gluster/md6/workdata     50262     0          Y     
3237
>> Brick glusterpub1:/gluster/md7/workdata     52238     0          Y     
5014
>> Brick glusterpub2:/gluster/md7/workdata     52474     0          Y     
3673
>> Brick glusterpub3:/gluster/md7/workdata     57966     0          Y     
3653
>> Self-heal Daemon on localhost               N/A       N/A        Y     
4141
>> Self-heal Daemon on glusterpub1             N/A       N/A        Y     
5570
>> Self-heal Daemon on glusterpub2             N/A       N/A        Y     
4139
>>
>> "gluster volume heal workdata info" lists a lot of files per
brick.
>> "gluster volume heal workdata statistics heal-count" shows
thousands
>> of files per brick.
>> "gluster volume heal workdata enable" has no effect.
>>
>> gluster volume heal workdata full
>> Launching heal operation to perform full self heal on volume workdata
>> has been successful
>> Use heal info commands to check status.
>>
>> -> not doing anything at all. And nothing happening on the 2
"good"
>> servers in e.g. glustershd.log. Heal was working as expected on
>> version 10.4, but here... silence. Someone has an idea?
>>
>>
>> Best regards,
>> Hubert
>>
>> Am Di., 16. Jan. 2024 um 13:44 Uhr schrieb Gilberto Ferreira
>> <gilberto.nunes32 at gmail.com>:
>>>
>>> Ah! Indeed! You need to perform an upgrade in the clients as well.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> Em ter., 16 de jan. de 2024 ?s 03:12, Hu Bert <revirii at
googlemail.com> escreveu:
>>>>
>>>> morning to those still reading :-)
>>>>
>>>> i found this:
https://docs.gluster.org/en/main/Troubleshooting/troubleshooting-glusterd/#common-issues-and-how-to-resolve-them
>>>>
>>>> there's a paragraph about "peer rejected" with
the same error message,
>>>> telling me: "Update the cluster.op-version" - i had
only updated the
>>>> server nodes, but not the clients. So upgrading the
cluster.op-version
>>>> wasn't possible at this time. So... upgrading the clients
to version
>>>> 11.1 and then the op-version should solve the problem?
>>>>
>>>>
>>>> Thx,
>>>> Hubert
>>>>
>>>> Am Mo., 15. Jan. 2024 um 09:16 Uhr schrieb Hu Bert <revirii
at googlemail.com>:
>>>>>
>>>>> Hi,
>>>>> just upgraded some gluster servers from version 10.4 to
version 11.1.
>>>>> Debian bullseye & bookworm. When only installing the
packages: good,
>>>>> servers, volumes etc. work as expected.
>>>>>
>>>>> But one needs to test if the systems work after a daemon
and/or server
>>>>> restart. Well, did a reboot, and after that the
rebooted/restarted
>>>>> system is "out". Log message from working node:
>>>>>
>>>>> [2024-01-15 08:02:21.585694 +0000] I [MSGID: 106163]
>>>>>
[glusterd-handshake.c:1501:__glusterd_mgmt_hndsk_versions_ack]
>>>>> 0-management: using the op-version 100000
>>>>> [2024-01-15 08:02:21.589601 +0000] I [MSGID: 106490]
>>>>>
[glusterd-handler.c:2546:__glusterd_handle_incoming_friend_req]
>>>>> 0-glusterd: Received probe from uuid:
>>>>> b71401c3-512a-47cb-ac18-473c4ba7776e
>>>>> [2024-01-15 08:02:23.608349 +0000] E [MSGID: 106010]
>>>>> [glusterd-utils.c:3824:glusterd_compare_friend_volume]
0-management:
>>>>> Version of Cksums sourceimages differ. local cksum =
2204642525,
>>>>> remote cksum = 1931483801 on peer gluster190
>>>>> [2024-01-15 08:02:23.608584 +0000] I [MSGID: 106493]
>>>>> [glusterd-handler.c:3819:glusterd_xfer_friend_add_resp]
0-glusterd:
>>>>> Responded to gluster190 (0), ret: 0, op_ret: -1
>>>>> [2024-01-15 08:02:23.613553 +0000] I [MSGID: 106493]
>>>>> [glusterd-rpc-ops.c:467:__glusterd_friend_add_cbk]
0-glusterd:
>>>>> Received RJT from uuid:
b71401c3-512a-47cb-ac18-473c4ba7776e, host:
>>>>> gluster190, port: 0
>>>>>
>>>>> peer status from rebooted node:
>>>>>
>>>>> root at gluster190 ~ # gluster peer status
>>>>> Number of Peers: 2
>>>>>
>>>>> Hostname: gluster189
>>>>> Uuid: 50dc8288-aa49-4ea8-9c6c-9a9a926c67a7
>>>>> State: Peer Rejected (Connected)
>>>>>
>>>>> Hostname: gluster188
>>>>> Uuid: e15a33fe-e2f7-47cf-ac53-a3b34136555d
>>>>> State: Peer Rejected (Connected)
>>>>>
>>>>> So the rebooted gluster190 is not accepted anymore. And
thus does not
>>>>> appear in "gluster volume status". I then
followed this guide:
>>>>>
>>>>>
https://gluster-documentations.readthedocs.io/en/latest/Administrator%20Guide/Resolving%20Peer%20Rejected/
>>>>>
>>>>> Remove everything under /var/lib/glusterd/ (except
glusterd.info) and
>>>>> restart glusterd service etc. Data get copied from other
nodes,
>>>>> 'gluster peer status' is ok again - but the volume
info is missing,
>>>>> /var/lib/glusterd/vols is empty. When syncing this dir from
another
>>>>> node, the volume then is available again, heals start etc.
>>>>>
>>>>> Well, and just to be sure that everything's working as
it should,
>>>>> rebooted that node again - the rebooted node is kicked out
again, and
>>>>> you have to restart bringing it back again.
>>>>>
>>>>> Sry, but did i miss anything? Has someone experienced
similar
>>>>> problems? I'll probably downgrade to 10.4 again, that
version was
>>>>> working...
>>>>>
>>>>>
>>>>> Thx,
>>>>> Hubert
>>>> ________
>>>>
>>>>
>>>>
>>>> Community Meeting Calendar:
>>>>
>>>> Schedule -
>>>> Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
>>>> Bridge: https://meet.google.com/cpu-eiue-hvk
>>>> Gluster-users mailing list
>>>> Gluster-users at gluster.org
>>>> https://lists.gluster.org/mailman/listinfo/gluster-users
> ________
> 
> 
> 
> Community Meeting Calendar:
> 
> Schedule -
> Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
> Bridge: https://meet.google.com/cpu-eiue-hvk
> Gluster-users mailing list
> Gluster-users at gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-users
-- 
Diego Zuccato
DIFA - Dip. di Fisica e Astronomia
Servizi Informatici
Alma Mater Studiorum - Universit? di Bologna
V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
tel.: +39 051 20 95786

Possibly Parallel Threads

Search for more reasonably related threads

Gluster users - Jan 2024 - Upgrade 10.4 -> 11.1 making problems

[Gluster-users] Upgrade 10.4 -> 11.1 making problems

[Gluster-users] Upgrade 10.4 -> 11.1 making problems

Possibly Parallel Threads