thr3ads.net - Gluster users - [Gluster-users] Upgrade from 5.13 to 7.5 full of weird messages [May 2020]

If this information is useful, please help other people find it:
Share via:

Strahil Nikolov

2020-May-01 08:07 UTC

[Gluster-users] Upgrade from 5.13 to 7.5 full of weird messages

On May 1, 2020 1:25:17 AM GMT+03:00, Artem Russakovskii <archon810 at
gmail.com> wrote:>If more time is needed to analyze this, is this an option? Shut down
>7.5,
>downgrade it back to 5.13 and restart, or would this screw something up
>badly? I didn't up the op-version yet.
>
>Thanks.
>
>Sincerely,
>Artem
>
>--
>Founder, Android Police <http://www.androidpolice.com>, APK Mirror
><http://www.apkmirror.com/>, Illogical Robot LLC
>beerpla.net | @ArtemR <http://twitter.com/ArtemR>
>
>
>On Thu, Apr 30, 2020 at 3:13 PM Artem Russakovskii
><archon810 at gmail.com>
>wrote:
>
>> The number of heal pending on citadel, the one that was upgraded to
>7.5,
>> has now gone to 10s of thousands and continues to go up.
>>
>> Sincerely,
>> Artem
>>
>> --
>> Founder, Android Police <http://www.androidpolice.com>, APK
Mirror
>> <http://www.apkmirror.com/>, Illogical Robot LLC
>> beerpla.net | @ArtemR <http://twitter.com/ArtemR>
>>
>>
>> On Thu, Apr 30, 2020 at 2:57 PM Artem Russakovskii
><archon810 at gmail.com>
>> wrote:
>>
>>> Hi all,
>>>
>>> Today, I decided to upgrade one of the four servers (citadel) we
>have to
>>> 7.5 from 5.13. There are 2 volumes, 1x4 replicate, and fuse mounts
>(I sent
>>> the full details earlier in another message). If everything looked
>OK, I
>>> would have proceeded the rolling upgrade for all of them, following
>the
>>> full heal.
>>>
>>> However, as soon as I upgraded and restarted, the logs filled with
>>> messages like these:
>>>
>>> [2020-04-30 21:39:21.316149] E
>>> [rpcsvc.c:567:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor
>>> (1298437:400:17) failed to complete successfully
>>> [2020-04-30 21:39:21.382891] E
>>> [rpcsvc.c:567:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor
>>> (1298437:400:17) failed to complete successfully
>>> [2020-04-30 21:39:21.442440] E
>>> [rpcsvc.c:567:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor
>>> (1298437:400:17) failed to complete successfully
>>> [2020-04-30 21:39:21.445587] E
>>> [rpcsvc.c:567:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor
>>> (1298437:400:17) failed to complete successfully
>>> [2020-04-30 21:39:21.571398] E
>>> [rpcsvc.c:567:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor
>>> (1298437:400:17) failed to complete successfully
>>> [2020-04-30 21:39:21.668192] E
>>> [rpcsvc.c:567:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor
>>> (1298437:400:17) failed to complete successfully
>>>
>>>
>>> The message "I [MSGID: 108031]
>>> [afr-common.c:2581:afr_local_discovery_cbk]
>>> 0-androidpolice_data3-replicate-0: selecting local read_child
>>> androidpolice_data3-client-3" repeated 10 times between
[2020-04-30
>>> 21:46:41.854675] and [2020-04-30 21:48:20.206323]
>>> The message "W [MSGID: 114031]
>>> [client-rpc-fops_v2.c:850:client4_0_setxattr_cbk]
>>> 0-androidpolice_data3-client-1: remote operation failed [Transport
>endpoint
>>> is not connected]" repeated 264 times between [2020-04-30
>21:46:32.129567]
>>> and [2020-04-30 21:48:29.905008]
>>> The message "W [MSGID: 114031]
>>> [client-rpc-fops_v2.c:850:client4_0_setxattr_cbk]
>>> 0-androidpolice_data3-client-0: remote operation failed [Transport
>endpoint
>>> is not connected]" repeated 264 times between [2020-04-30
>21:46:32.129602]
>>> and [2020-04-30 21:48:29.905040]
>>> The message "W [MSGID: 114031]
>>> [client-rpc-fops_v2.c:850:client4_0_setxattr_cbk]
>>> 0-androidpolice_data3-client-2: remote operation failed [Transport
>endpoint
>>> is not connected]" repeated 264 times between [2020-04-30
>21:46:32.129512]
>>> and [2020-04-30 21:48:29.905047]
>>>
>>>
>>>
>>> Once in a while, I'm seeing this:
>>> ==> bricks/mnt-hive_block4-androidpolice_data3.log
<=>>> [2020-04-30 21:45:54.251637] I [MSGID: 115072]
>>> [server-rpc-fops_v2.c:1681:server4_setattr_cbk]
>>> 0-androidpolice_data3-server: 5725811: SETATTR /
>>>
>androidpolice.com/public/wp-content/uploads/2019/03/cielo-breez-plus-hero.png
>>> (d4556eb4-f15b-412c-a42a-32b4438af557), client:
>>>
>CTX_ID:32e2d636-038a-472d-8199-007555d1805f-GRAPH_ID:0-PID:14265-HOST:nexus2-PC_NAME:androidpolice_data3-client-2-RECON_NO:-1,
>>> error-xlator: androidpolice_data3-access-control [Operation not
>permitted]
>>> [2020-04-30 21:49:10.439701] I [MSGID: 115072]
>>> [server-rpc-fops_v2.c:1680:server4_setattr_cbk]
>>> 0-androidpolice_data3-server: 201833: SETATTR /
>>> androidpolice.com/public/wp-content/uploads
>>> (2692eeba-1ebe-49b6-927f-1dfbcd227591), client:
>>>
>CTX_ID:af341e80-70ff-4d23-99ef-3d846a546fc9-GRAPH_ID:0-PID:2358-HOST:forge-PC_NAME:androidpolice_data3-client-3-RECON_NO:-2,
>>> error-xlator: androidpolice_data3-access-control [Operation not
>permitted]
>>> [2020-04-30 21:49:10.453724] I [MSGID: 115072]
>>> [server-rpc-fops_v2.c:1680:server4_setattr_cbk]
>>> 0-androidpolice_data3-server: 201842: SETATTR /
>>> androidpolice.com/public/wp-content/uploads
>>> (2692eeba-1ebe-49b6-927f-1dfbcd227591), client:
>>>
>CTX_ID:af341e80-70ff-4d23-99ef-3d846a546fc9-GRAPH_ID:0-PID:2358-HOST:forge-PC_NAME:androidpolice_data3-client-3-RECON_NO:-2,
>>> error-xlator: androidpolice_data3-access-control [Operation not
>permitted]
>>> [2020-04-30 21:49:16.224662] I [MSGID: 115072]
>>> [server-rpc-fops_v2.c:1680:server4_setattr_cbk]
>>> 0-androidpolice_data3-server: 202865: SETATTR /
>>> androidpolice.com/public/wp-content/uploads
>>> (2692eeba-1ebe-49b6-927f-1dfbcd227591), client:
>>>
>CTX_ID:32e2d636-038a-472d-8199-007555d1805f-GRAPH_ID:0-PID:14265-HOST:nexus2-PC_NAME:androidpolice_data3-client-3-RECON_NO:-2,
>>> error-xlator: androidpolice_data3-access-control [Operation not
>permitted]
>>>
>>> There's also lots of self-healing happening that I didn't
expect at
>all,
>>> since the upgrade only took ~10-15s.
>>> [2020-04-30 21:47:38.714448] I [MSGID: 108026]
>>> [afr-self-heal-metadata.c:52:__afr_selfheal_metadata_do]
>>> 0-apkmirror_data1-replicate-0: performing metadata selfheal on
>>> 4a6ba2d7-7ad8-4113-862b-02e4934a3461
>>> [2020-04-30 21:47:38.765033] I [MSGID: 108026]
>>> [afr-self-heal-common.c:1723:afr_log_selfheal]
>>> 0-apkmirror_data1-replicate-0: Completed metadata selfheal on
>>> 4a6ba2d7-7ad8-4113-862b-02e4934a3461. sources=[3]  sinks=0 1 2
>>> [2020-04-30 21:47:38.765289] I [MSGID: 108026]
>>> [afr-self-heal-metadata.c:52:__afr_selfheal_metadata_do]
>>> 0-apkmirror_data1-replicate-0: performing metadata selfheal on
>>> f3c62a41-1864-4e75-9883-4357a7091296
>>> [2020-04-30 21:47:38.800987] I [MSGID: 108026]
>>> [afr-self-heal-common.c:1723:afr_log_selfheal]
>>> 0-apkmirror_data1-replicate-0: Completed metadata selfheal on
>>> f3c62a41-1864-4e75-9883-4357a7091296. sources=[3]  sinks=0 1 2
>>>
>>>
>>> I'm also seeing "remote operation failed" and
"writing to fuse
>device
>>> failed: No such file or directory" messages
>>> [2020-04-30 21:46:34.891957] I [MSGID: 108026]
>>> [afr-self-heal-common.c:1723:afr_log_selfheal]
>>> 0-androidpolice_data3-replicate-0: Completed metadata selfheal on
>>> 2692eeba-1ebe-49b6-927f-1dfbcd227591. sources=0 1 [2]  sinks=3
>>> [2020-04-30 21:45:36.127412] W [MSGID: 114031]
>>> [client-rpc-fops_v2.c:1985:client4_0_setattr_cbk]
>>> 0-androidpolice_data3-client-0: remote operation failed [Operation
>not
>>> permitted]
>>> [2020-04-30 21:45:36.345924] W [MSGID: 114031]
>>> [client-rpc-fops_v2.c:1985:client4_0_setattr_cbk]
>>> 0-androidpolice_data3-client-1: remote operation failed [Operation
>not
>>> permitted]
>>> [2020-04-30 21:46:35.291853] I [MSGID: 108031]
>>> [afr-common.c:2543:afr_local_discovery_cbk]
>>> 0-androidpolice_data3-replicate-0: selecting local read_child
>>> androidpolice_data3-client-2
>>> [2020-04-30 21:46:35.977342] I [MSGID: 108026]
>>> [afr-self-heal-metadata.c:52:__afr_selfheal_metadata_do]
>>> 0-androidpolice_data3-replicate-0: performing metadata selfheal on
>>> 2692eeba-1ebe-49b6-927f-1dfbcd227591
>>> [2020-04-30 21:46:36.006607] I [MSGID: 108026]
>>> [afr-self-heal-common.c:1723:afr_log_selfheal]
>>> 0-androidpolice_data3-replicate-0: Completed metadata selfheal on
>>> 2692eeba-1ebe-49b6-927f-1dfbcd227591. sources=0 1 [2]  sinks=3
>>> [2020-04-30 21:46:37.245599] E
>[fuse-bridge.c:219:check_and_dump_fuse_W]
>>> (-->
>/usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x17d)[0x7fd13d50624d]
>>> (-->
>>>
>/usr/lib64/glusterfs/5.13/xlator/mount/fuse.so(+0x849a)[0x7fd1398e949a]
>>> (-->
>>>
>/usr/lib64/glusterfs/5.13/xlator/mount/fuse.so(+0x87bb)[0x7fd1398e97bb]
>>> (--> /lib64/libpthread.so.0(+0x84f9)[0x7fd13ca564f9] (-->
>>> /lib64/libc.so.6(clone+0x3f)[0x7fd13c78ef2f] )))))
0-glusterfs-fuse:
>>> writing to fuse device failed: No such file or directory
>>> [2020-04-30 21:46:50.864797] E
>[fuse-bridge.c:219:check_and_dump_fuse_W]
>>> (-->
>/usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x17d)[0x7fd13d50624d]
>>> (-->
>>>
>/usr/lib64/glusterfs/5.13/xlator/mount/fuse.so(+0x849a)[0x7fd1398e949a]
>>> (-->
>>>
>/usr/lib64/glusterfs/5.13/xlator/mount/fuse.so(+0x87bb)[0x7fd1398e97bb]
>>> (--> /lib64/libpthread.so.0(+0x84f9)[0x7fd13ca564f9] (-->
>>> /lib64/libc.so.6(clone+0x3f)[0x7fd13c78ef2f] )))))
0-glusterfs-fuse:
>>> writing to fuse device failed: No such file or directory
>>>
>>> The number of items being healed is going up and down wildly, from
0
>to
>>> 8000+ and sometimes taking a really long time to return a value.
I'm
>really
>>> worried as this is a production system, and I didn't observe
this in
>our
>>> test system.
>>>
>>>
>>>
>>> gluster v heal apkmirror_data1 info summary
>>> Brick nexus2:/mnt/nexus2_block1/apkmirror_data1
>>> Status: Connected
>>> Total Number of entries: 27
>>> Number of entries in heal pending: 27
>>> Number of entries in split-brain: 0
>>> Number of entries possibly healing: 0
>>>
>>> Brick forge:/mnt/forge_block1/apkmirror_data1
>>> Status: Connected
>>> Total Number of entries: 27
>>> Number of entries in heal pending: 27
>>> Number of entries in split-brain: 0
>>> Number of entries possibly healing: 0
>>>
>>> Brick hive:/mnt/hive_block1/apkmirror_data1
>>> Status: Connected
>>> Total Number of entries: 27
>>> Number of entries in heal pending: 27
>>> Number of entries in split-brain: 0
>>> Number of entries possibly healing: 0
>>>
>>> Brick citadel:/mnt/citadel_block1/apkmirror_data1
>>> Status: Connected
>>> Total Number of entries: 8540
>>> Number of entries in heal pending: 8540
>>> Number of entries in split-brain: 0
>>> Number of entries possibly healing: 0
>>>
>>>
>>>
>>> gluster v heal androidpolice_data3 info summary
>>> Brick nexus2:/mnt/nexus2_block4/androidpolice_data3
>>> Status: Connected
>>> Total Number of entries: 1
>>> Number of entries in heal pending: 1
>>> Number of entries in split-brain: 0
>>> Number of entries possibly healing: 0
>>>
>>> Brick forge:/mnt/forge_block4/androidpolice_data3
>>> Status: Connected
>>> Total Number of entries: 1
>>> Number of entries in heal pending: 1
>>> Number of entries in split-brain: 0
>>> Number of entries possibly healing: 0
>>>
>>> Brick hive:/mnt/hive_block4/androidpolice_data3
>>> Status: Connected
>>> Total Number of entries: 1
>>> Number of entries in heal pending: 1
>>> Number of entries in split-brain: 0
>>> Number of entries possibly healing: 0
>>>
>>> Brick citadel:/mnt/citadel_block4/androidpolice_data3
>>> Status: Connected
>>> Total Number of entries: 1149
>>> Number of entries in heal pending: 1149
>>> Number of entries in split-brain: 0
>>> Number of entries possibly healing: 0
>>>
>>>
>>> What should I do at this point? The files I tested seem to be
>replicating
>>> correctly, but I don't know if it's the case for all of
them, and
>the heals
>>> going up and down, and all these log messages are making me very
>nervous.
>>>
>>> Thank you.
>>>
>>> Sincerely,
>>> Artem
>>>
>>> --
>>> Founder, Android Police <http://www.androidpolice.com>, APK
Mirror
>>> <http://www.apkmirror.com/>, Illogical Robot LLC
>>> beerpla.net | @ArtemR <http://twitter.com/ArtemR>
>>>
>>
I's not supported  , but usually it works.

In worst case scenario,  you can remove the node, wipe gluster on the node,
reinstall the packages and add it - it will require full heal of the brick and
as you have previously reported could lead to performance degradation.

I think you are on SLES, but I could be wrong . Do you have btrfs or LVM 
snapshots to revert from ?

Best Regards,
Strahil Nikolov

Artem Russakovskii

2020-May-01 14:25 UTC

head link

[Gluster-users] Upgrade from 5.13 to 7.5 full of weird messages

I do not have snapshots, no. I have a general file based backup, but also
the other 3 nodes are up.

OpenSUSE 15.1.

If I try to downgrade and it doesn't work, what's the brick replacement
scenario - is this still accurate?
https://docs.gluster.org/en/latest/Administrator%20Guide/Managing%20Volumes/#replace-brick

Any feedback about the issues themselves yet please? Specifically, is there
a chance this is happening because of the mismatched gluster versions?
Though, what's the solution then?

On Fri, May 1, 2020, 1:07 AM Strahil Nikolov <hunter86_bg at yahoo.com>
wrote:
> On May 1, 2020 1:25:17 AM GMT+03:00, Artem Russakovskii <
> archon810 at gmail.com> wrote:
> >If more time is needed to analyze this, is this an option? Shut down
> >7.5,
> >downgrade it back to 5.13 and restart, or would this screw something up
> >badly? I didn't up the op-version yet.
> >
> >Thanks.
> >
> >Sincerely,
> >Artem
> >
> >--
> >Founder, Android Police <http://www.androidpolice.com>, APK
Mirror
> ><http://www.apkmirror.com/>, Illogical Robot LLC
> >beerpla.net | @ArtemR <http://twitter.com/ArtemR>
> >
> >
> >On Thu, Apr 30, 2020 at 3:13 PM Artem Russakovskii
> ><archon810 at gmail.com>
> >wrote:
> >
> >> The number of heal pending on citadel, the one that was upgraded
to
> >7.5,
> >> has now gone to 10s of thousands and continues to go up.
> >>
> >> Sincerely,
> >> Artem
> >>
> >> --
> >> Founder, Android Police <http://www.androidpolice.com>, APK
Mirror
> >> <http://www.apkmirror.com/>, Illogical Robot LLC
> >> beerpla.net | @ArtemR <http://twitter.com/ArtemR>
> >>
> >>
> >> On Thu, Apr 30, 2020 at 2:57 PM Artem Russakovskii
> ><archon810 at gmail.com>
> >> wrote:
> >>
> >>> Hi all,
> >>>
> >>> Today, I decided to upgrade one of the four servers (citadel)
we
> >have to
> >>> 7.5 from 5.13. There are 2 volumes, 1x4 replicate, and fuse
mounts
> >(I sent
> >>> the full details earlier in another message). If everything
looked
> >OK, I
> >>> would have proceeded the rolling upgrade for all of them,
following
> >the
> >>> full heal.
> >>>
> >>> However, as soon as I upgraded and restarted, the logs filled
with
> >>> messages like these:
> >>>
> >>> [2020-04-30 21:39:21.316149] E
> >>> [rpcsvc.c:567:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc
actor
> >>> (1298437:400:17) failed to complete successfully
> >>> [2020-04-30 21:39:21.382891] E
> >>> [rpcsvc.c:567:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc
actor
> >>> (1298437:400:17) failed to complete successfully
> >>> [2020-04-30 21:39:21.442440] E
> >>> [rpcsvc.c:567:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc
actor
> >>> (1298437:400:17) failed to complete successfully
> >>> [2020-04-30 21:39:21.445587] E
> >>> [rpcsvc.c:567:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc
actor
> >>> (1298437:400:17) failed to complete successfully
> >>> [2020-04-30 21:39:21.571398] E
> >>> [rpcsvc.c:567:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc
actor
> >>> (1298437:400:17) failed to complete successfully
> >>> [2020-04-30 21:39:21.668192] E
> >>> [rpcsvc.c:567:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc
actor
> >>> (1298437:400:17) failed to complete successfully
> >>>
> >>>
> >>> The message "I [MSGID: 108031]
> >>> [afr-common.c:2581:afr_local_discovery_cbk]
> >>> 0-androidpolice_data3-replicate-0: selecting local read_child
> >>> androidpolice_data3-client-3" repeated 10 times between
[2020-04-30
> >>> 21:46:41.854675] and [2020-04-30 21:48:20.206323]
> >>> The message "W [MSGID: 114031]
> >>> [client-rpc-fops_v2.c:850:client4_0_setxattr_cbk]
> >>> 0-androidpolice_data3-client-1: remote operation failed
[Transport
> >endpoint
> >>> is not connected]" repeated 264 times between [2020-04-30
> >21:46:32.129567]
> >>> and [2020-04-30 21:48:29.905008]
> >>> The message "W [MSGID: 114031]
> >>> [client-rpc-fops_v2.c:850:client4_0_setxattr_cbk]
> >>> 0-androidpolice_data3-client-0: remote operation failed
[Transport
> >endpoint
> >>> is not connected]" repeated 264 times between [2020-04-30
> >21:46:32.129602]
> >>> and [2020-04-30 21:48:29.905040]
> >>> The message "W [MSGID: 114031]
> >>> [client-rpc-fops_v2.c:850:client4_0_setxattr_cbk]
> >>> 0-androidpolice_data3-client-2: remote operation failed
[Transport
> >endpoint
> >>> is not connected]" repeated 264 times between [2020-04-30
> >21:46:32.129512]
> >>> and [2020-04-30 21:48:29.905047]
> >>>
> >>>
> >>>
> >>> Once in a while, I'm seeing this:
> >>> ==> bricks/mnt-hive_block4-androidpolice_data3.log
<=> >>> [2020-04-30 21:45:54.251637] I [MSGID: 115072]
> >>> [server-rpc-fops_v2.c:1681:server4_setattr_cbk]
> >>> 0-androidpolice_data3-server: 5725811: SETATTR /
> >>>
> >
>
androidpolice.com/public/wp-content/uploads/2019/03/cielo-breez-plus-hero.png
> >>> (d4556eb4-f15b-412c-a42a-32b4438af557), client:
> >>>
>
>
>CTX_ID:32e2d636-038a-472d-8199-007555d1805f-GRAPH_ID:0-PID:14265-HOST:nexus2-PC_NAME:androidpolice_data3-client-2-RECON_NO:-1,
> >>> error-xlator: androidpolice_data3-access-control [Operation
not
> >permitted]
> >>> [2020-04-30 21:49:10.439701] I [MSGID: 115072]
> >>> [server-rpc-fops_v2.c:1680:server4_setattr_cbk]
> >>> 0-androidpolice_data3-server: 201833: SETATTR /
> >>> androidpolice.com/public/wp-content/uploads
> >>> (2692eeba-1ebe-49b6-927f-1dfbcd227591), client:
> >>>
>
>
>CTX_ID:af341e80-70ff-4d23-99ef-3d846a546fc9-GRAPH_ID:0-PID:2358-HOST:forge-PC_NAME:androidpolice_data3-client-3-RECON_NO:-2,
> >>> error-xlator: androidpolice_data3-access-control [Operation
not
> >permitted]
> >>> [2020-04-30 21:49:10.453724] I [MSGID: 115072]
> >>> [server-rpc-fops_v2.c:1680:server4_setattr_cbk]
> >>> 0-androidpolice_data3-server: 201842: SETATTR /
> >>> androidpolice.com/public/wp-content/uploads
> >>> (2692eeba-1ebe-49b6-927f-1dfbcd227591), client:
> >>>
>
>
>CTX_ID:af341e80-70ff-4d23-99ef-3d846a546fc9-GRAPH_ID:0-PID:2358-HOST:forge-PC_NAME:androidpolice_data3-client-3-RECON_NO:-2,
> >>> error-xlator: androidpolice_data3-access-control [Operation
not
> >permitted]
> >>> [2020-04-30 21:49:16.224662] I [MSGID: 115072]
> >>> [server-rpc-fops_v2.c:1680:server4_setattr_cbk]
> >>> 0-androidpolice_data3-server: 202865: SETATTR /
> >>> androidpolice.com/public/wp-content/uploads
> >>> (2692eeba-1ebe-49b6-927f-1dfbcd227591), client:
> >>>
>
>
>CTX_ID:32e2d636-038a-472d-8199-007555d1805f-GRAPH_ID:0-PID:14265-HOST:nexus2-PC_NAME:androidpolice_data3-client-3-RECON_NO:-2,
> >>> error-xlator: androidpolice_data3-access-control [Operation
not
> >permitted]
> >>>
> >>> There's also lots of self-healing happening that I
didn't expect at
> >all,
> >>> since the upgrade only took ~10-15s.
> >>> [2020-04-30 21:47:38.714448] I [MSGID: 108026]
> >>> [afr-self-heal-metadata.c:52:__afr_selfheal_metadata_do]
> >>> 0-apkmirror_data1-replicate-0: performing metadata selfheal on
> >>> 4a6ba2d7-7ad8-4113-862b-02e4934a3461
> >>> [2020-04-30 21:47:38.765033] I [MSGID: 108026]
> >>> [afr-self-heal-common.c:1723:afr_log_selfheal]
> >>> 0-apkmirror_data1-replicate-0: Completed metadata selfheal on
> >>> 4a6ba2d7-7ad8-4113-862b-02e4934a3461. sources=[3]  sinks=0 1 2
> >>> [2020-04-30 21:47:38.765289] I [MSGID: 108026]
> >>> [afr-self-heal-metadata.c:52:__afr_selfheal_metadata_do]
> >>> 0-apkmirror_data1-replicate-0: performing metadata selfheal on
> >>> f3c62a41-1864-4e75-9883-4357a7091296
> >>> [2020-04-30 21:47:38.800987] I [MSGID: 108026]
> >>> [afr-self-heal-common.c:1723:afr_log_selfheal]
> >>> 0-apkmirror_data1-replicate-0: Completed metadata selfheal on
> >>> f3c62a41-1864-4e75-9883-4357a7091296. sources=[3]  sinks=0 1 2
> >>>
> >>>
> >>> I'm also seeing "remote operation failed" and
"writing to fuse
> >device
> >>> failed: No such file or directory" messages
> >>> [2020-04-30 21:46:34.891957] I [MSGID: 108026]
> >>> [afr-self-heal-common.c:1723:afr_log_selfheal]
> >>> 0-androidpolice_data3-replicate-0: Completed metadata selfheal
on
> >>> 2692eeba-1ebe-49b6-927f-1dfbcd227591. sources=0 1 [2]  sinks=3
> >>> [2020-04-30 21:45:36.127412] W [MSGID: 114031]
> >>> [client-rpc-fops_v2.c:1985:client4_0_setattr_cbk]
> >>> 0-androidpolice_data3-client-0: remote operation failed
[Operation
> >not
> >>> permitted]
> >>> [2020-04-30 21:45:36.345924] W [MSGID: 114031]
> >>> [client-rpc-fops_v2.c:1985:client4_0_setattr_cbk]
> >>> 0-androidpolice_data3-client-1: remote operation failed
[Operation
> >not
> >>> permitted]
> >>> [2020-04-30 21:46:35.291853] I [MSGID: 108031]
> >>> [afr-common.c:2543:afr_local_discovery_cbk]
> >>> 0-androidpolice_data3-replicate-0: selecting local read_child
> >>> androidpolice_data3-client-2
> >>> [2020-04-30 21:46:35.977342] I [MSGID: 108026]
> >>> [afr-self-heal-metadata.c:52:__afr_selfheal_metadata_do]
> >>> 0-androidpolice_data3-replicate-0: performing metadata
selfheal on
> >>> 2692eeba-1ebe-49b6-927f-1dfbcd227591
> >>> [2020-04-30 21:46:36.006607] I [MSGID: 108026]
> >>> [afr-self-heal-common.c:1723:afr_log_selfheal]
> >>> 0-androidpolice_data3-replicate-0: Completed metadata selfheal
on
> >>> 2692eeba-1ebe-49b6-927f-1dfbcd227591. sources=0 1 [2]  sinks=3
> >>> [2020-04-30 21:46:37.245599] E
> >[fuse-bridge.c:219:check_and_dump_fuse_W]
> >>> (-->
> >/usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x17d)[0x7fd13d50624d]
> >>> (-->
> >>>
> >/usr/lib64/glusterfs/5.13/xlator/mount/fuse.so(+0x849a)[0x7fd1398e949a]
> >>> (-->
> >>>
> >/usr/lib64/glusterfs/5.13/xlator/mount/fuse.so(+0x87bb)[0x7fd1398e97bb]
> >>> (--> /lib64/libpthread.so.0(+0x84f9)[0x7fd13ca564f9]
(-->
> >>> /lib64/libc.so.6(clone+0x3f)[0x7fd13c78ef2f] )))))
0-glusterfs-fuse:
> >>> writing to fuse device failed: No such file or directory
> >>> [2020-04-30 21:46:50.864797] E
> >[fuse-bridge.c:219:check_and_dump_fuse_W]
> >>> (-->
> >/usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x17d)[0x7fd13d50624d]
> >>> (-->
> >>>
> >/usr/lib64/glusterfs/5.13/xlator/mount/fuse.so(+0x849a)[0x7fd1398e949a]
> >>> (-->
> >>>
> >/usr/lib64/glusterfs/5.13/xlator/mount/fuse.so(+0x87bb)[0x7fd1398e97bb]
> >>> (--> /lib64/libpthread.so.0(+0x84f9)[0x7fd13ca564f9]
(-->
> >>> /lib64/libc.so.6(clone+0x3f)[0x7fd13c78ef2f] )))))
0-glusterfs-fuse:
> >>> writing to fuse device failed: No such file or directory
> >>>
> >>> The number of items being healed is going up and down wildly,
from 0
> >to
> >>> 8000+ and sometimes taking a really long time to return a
value. I'm
> >really
> >>> worried as this is a production system, and I didn't
observe this in
> >our
> >>> test system.
> >>>
> >>>
> >>>
> >>> gluster v heal apkmirror_data1 info summary
> >>> Brick nexus2:/mnt/nexus2_block1/apkmirror_data1
> >>> Status: Connected
> >>> Total Number of entries: 27
> >>> Number of entries in heal pending: 27
> >>> Number of entries in split-brain: 0
> >>> Number of entries possibly healing: 0
> >>>
> >>> Brick forge:/mnt/forge_block1/apkmirror_data1
> >>> Status: Connected
> >>> Total Number of entries: 27
> >>> Number of entries in heal pending: 27
> >>> Number of entries in split-brain: 0
> >>> Number of entries possibly healing: 0
> >>>
> >>> Brick hive:/mnt/hive_block1/apkmirror_data1
> >>> Status: Connected
> >>> Total Number of entries: 27
> >>> Number of entries in heal pending: 27
> >>> Number of entries in split-brain: 0
> >>> Number of entries possibly healing: 0
> >>>
> >>> Brick citadel:/mnt/citadel_block1/apkmirror_data1
> >>> Status: Connected
> >>> Total Number of entries: 8540
> >>> Number of entries in heal pending: 8540
> >>> Number of entries in split-brain: 0
> >>> Number of entries possibly healing: 0
> >>>
> >>>
> >>>
> >>> gluster v heal androidpolice_data3 info summary
> >>> Brick nexus2:/mnt/nexus2_block4/androidpolice_data3
> >>> Status: Connected
> >>> Total Number of entries: 1
> >>> Number of entries in heal pending: 1
> >>> Number of entries in split-brain: 0
> >>> Number of entries possibly healing: 0
> >>>
> >>> Brick forge:/mnt/forge_block4/androidpolice_data3
> >>> Status: Connected
> >>> Total Number of entries: 1
> >>> Number of entries in heal pending: 1
> >>> Number of entries in split-brain: 0
> >>> Number of entries possibly healing: 0
> >>>
> >>> Brick hive:/mnt/hive_block4/androidpolice_data3
> >>> Status: Connected
> >>> Total Number of entries: 1
> >>> Number of entries in heal pending: 1
> >>> Number of entries in split-brain: 0
> >>> Number of entries possibly healing: 0
> >>>
> >>> Brick citadel:/mnt/citadel_block4/androidpolice_data3
> >>> Status: Connected
> >>> Total Number of entries: 1149
> >>> Number of entries in heal pending: 1149
> >>> Number of entries in split-brain: 0
> >>> Number of entries possibly healing: 0
> >>>
> >>>
> >>> What should I do at this point? The files I tested seem to be
> >replicating
> >>> correctly, but I don't know if it's the case for all
of them, and
> >the heals
> >>> going up and down, and all these log messages are making me
very
> >nervous.
> >>>
> >>> Thank you.
> >>>
> >>> Sincerely,
> >>> Artem
> >>>
> >>> --
> >>> Founder, Android Police <http://www.androidpolice.com>,
APK Mirror
> >>> <http://www.apkmirror.com/>, Illogical Robot LLC
> >>> beerpla.net | @ArtemR <http://twitter.com/ArtemR>
> >>>
> >>
>
> I's not supported  , but usually it works.
>
> In worst case scenario,  you can remove the node, wipe gluster on the
> node, reinstall the packages and add it - it will require full heal of the
> brick and as you have previously reported could lead to performance
> degradation.
>
> I think you are on SLES, but I could be wrong . Do you have btrfs or LVM
> snapshots to revert from ?
>
> Best Regards,
> Strahil Nikolov
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.gluster.org/pipermail/gluster-users/attachments/20200501/6373add4/attachment.html>

Gluster users - May 2020 - Upgrade from 5.13 to 7.5 full of weird messages

[Gluster-users] Upgrade from 5.13 to 7.5 full of weird messages

[Gluster-users] Upgrade from 5.13 to 7.5 full of weird messages