thr3ads.net - Gluster users - [Gluster-users] Upgrade from 5.13 to 7.5 full of weird messages [May 2020]

If this information is useful, please help other people find it:
Share via:

Artem Russakovskii

2020-May-01 17:03 UTC

[Gluster-users] Upgrade from 5.13 to 7.5 full of weird messages

The good news is the downgrade seems to have worked and was painless.

zypper install --oldpackage glusterfs-5.13, restart gluster, and almost
immediately there are no heal pending entries anymore.

The only things still showing up in the logs, besides some healing is
0-glusterfs-fuse:
writing to fuse device failed: No such file or directory:
==> mnt-androidpolice_data3.log <=[2020-05-01 16:54:21.085643] E
[fuse-bridge.c:219:check_and_dump_fuse_W]
(--> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x17d)[0x7fd13d50624d]
(-->
/usr/lib64/glusterfs/5.13/xlator/mount/fuse.so(+0x849a)[0x7fd1398e949a]
(-->
/usr/lib64/glusterfs/5.13/xlator/mount/fuse.so(+0x87bb)[0x7fd1398e97bb]
(--> /lib64/libpthread.so.0(+0x84f9)[0x7fd13ca564f9] (-->
/lib64/libc.so.6(clone+0x3f)[0x7fd13c78ef2f] ))))) 0-glusterfs-fuse:
writing to fuse device failed: No such file or directory
==> mnt-apkmirror_data1.log <=[2020-05-01 16:54:21.268842] E
[fuse-bridge.c:219:check_and_dump_fuse_W]
(--> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x17d)[0x7fdf2b0a624d]
(-->
/usr/lib64/glusterfs/5.13/xlator/mount/fuse.so(+0x849a)[0x7fdf2748949a]
(-->
/usr/lib64/glusterfs/5.13/xlator/mount/fuse.so(+0x87bb)[0x7fdf274897bb]
(--> /lib64/libpthread.so.0(+0x84f9)[0x7fdf2a5f64f9] (-->
/lib64/libc.so.6(clone+0x3f)[0x7fdf2a32ef2f] ))))) 0-glusterfs-fuse:
writing to fuse device failed: No such file or directory

It'd be very helpful if it had more info about what failed to write and why.

I'd still really love to see the analysis of this failed upgrade from core
gluster maintainers to see what needs fixing and how we can upgrade in the
future.

Thanks.

Sincerely,
Artem

--
Founder, Android Police <http://www.androidpolice.com>, APK Mirror
<http://www.apkmirror.com/>, Illogical Robot LLC
beerpla.net | @ArtemR <http://twitter.com/ArtemR>


On Fri, May 1, 2020 at 7:25 AM Artem Russakovskii <archon810 at gmail.com>
wrote:
> I do not have snapshots, no. I have a general file based backup, but also
> the other 3 nodes are up.
>
> OpenSUSE 15.1.
>
> If I try to downgrade and it doesn't work, what's the brick
replacement
> scenario - is this still accurate?
>
https://docs.gluster.org/en/latest/Administrator%20Guide/Managing%20Volumes/#replace-brick
>
> Any feedback about the issues themselves yet please? Specifically, is
> there a chance this is happening because of the mismatched gluster
> versions? Though, what's the solution then?
>
> On Fri, May 1, 2020, 1:07 AM Strahil Nikolov <hunter86_bg at
yahoo.com>
> wrote:
>
>> On May 1, 2020 1:25:17 AM GMT+03:00, Artem Russakovskii <
>> archon810 at gmail.com> wrote:
>> >If more time is needed to analyze this, is this an option? Shut
down
>> >7.5,
>> >downgrade it back to 5.13 and restart, or would this screw
something up
>> >badly? I didn't up the op-version yet.
>> >
>> >Thanks.
>> >
>> >Sincerely,
>> >Artem
>> >
>> >--
>> >Founder, Android Police <http://www.androidpolice.com>, APK
Mirror
>> ><http://www.apkmirror.com/>, Illogical Robot LLC
>> >beerpla.net | @ArtemR <http://twitter.com/ArtemR>
>> >
>> >
>> >On Thu, Apr 30, 2020 at 3:13 PM Artem Russakovskii
>> ><archon810 at gmail.com>
>> >wrote:
>> >
>> >> The number of heal pending on citadel, the one that was
upgraded to
>> >7.5,
>> >> has now gone to 10s of thousands and continues to go up.
>> >>
>> >> Sincerely,
>> >> Artem
>> >>
>> >> --
>> >> Founder, Android Police <http://www.androidpolice.com>,
APK Mirror
>> >> <http://www.apkmirror.com/>, Illogical Robot LLC
>> >> beerpla.net | @ArtemR <http://twitter.com/ArtemR>
>> >>
>> >>
>> >> On Thu, Apr 30, 2020 at 2:57 PM Artem Russakovskii
>> ><archon810 at gmail.com>
>> >> wrote:
>> >>
>> >>> Hi all,
>> >>>
>> >>> Today, I decided to upgrade one of the four servers
(citadel) we
>> >have to
>> >>> 7.5 from 5.13. There are 2 volumes, 1x4 replicate, and
fuse mounts
>> >(I sent
>> >>> the full details earlier in another message). If
everything looked
>> >OK, I
>> >>> would have proceeded the rolling upgrade for all of them,
following
>> >the
>> >>> full heal.
>> >>>
>> >>> However, as soon as I upgraded and restarted, the logs
filled with
>> >>> messages like these:
>> >>>
>> >>> [2020-04-30 21:39:21.316149] E
>> >>> [rpcsvc.c:567:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc
actor
>> >>> (1298437:400:17) failed to complete successfully
>> >>> [2020-04-30 21:39:21.382891] E
>> >>> [rpcsvc.c:567:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc
actor
>> >>> (1298437:400:17) failed to complete successfully
>> >>> [2020-04-30 21:39:21.442440] E
>> >>> [rpcsvc.c:567:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc
actor
>> >>> (1298437:400:17) failed to complete successfully
>> >>> [2020-04-30 21:39:21.445587] E
>> >>> [rpcsvc.c:567:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc
actor
>> >>> (1298437:400:17) failed to complete successfully
>> >>> [2020-04-30 21:39:21.571398] E
>> >>> [rpcsvc.c:567:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc
actor
>> >>> (1298437:400:17) failed to complete successfully
>> >>> [2020-04-30 21:39:21.668192] E
>> >>> [rpcsvc.c:567:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc
actor
>> >>> (1298437:400:17) failed to complete successfully
>> >>>
>> >>>
>> >>> The message "I [MSGID: 108031]
>> >>> [afr-common.c:2581:afr_local_discovery_cbk]
>> >>> 0-androidpolice_data3-replicate-0: selecting local
read_child
>> >>> androidpolice_data3-client-3" repeated 10 times
between [2020-04-30
>> >>> 21:46:41.854675] and [2020-04-30 21:48:20.206323]
>> >>> The message "W [MSGID: 114031]
>> >>> [client-rpc-fops_v2.c:850:client4_0_setxattr_cbk]
>> >>> 0-androidpolice_data3-client-1: remote operation failed
[Transport
>> >endpoint
>> >>> is not connected]" repeated 264 times between
[2020-04-30
>> >21:46:32.129567]
>> >>> and [2020-04-30 21:48:29.905008]
>> >>> The message "W [MSGID: 114031]
>> >>> [client-rpc-fops_v2.c:850:client4_0_setxattr_cbk]
>> >>> 0-androidpolice_data3-client-0: remote operation failed
[Transport
>> >endpoint
>> >>> is not connected]" repeated 264 times between
[2020-04-30
>> >21:46:32.129602]
>> >>> and [2020-04-30 21:48:29.905040]
>> >>> The message "W [MSGID: 114031]
>> >>> [client-rpc-fops_v2.c:850:client4_0_setxattr_cbk]
>> >>> 0-androidpolice_data3-client-2: remote operation failed
[Transport
>> >endpoint
>> >>> is not connected]" repeated 264 times between
[2020-04-30
>> >21:46:32.129512]
>> >>> and [2020-04-30 21:48:29.905047]
>> >>>
>> >>>
>> >>>
>> >>> Once in a while, I'm seeing this:
>> >>> ==> bricks/mnt-hive_block4-androidpolice_data3.log
<=>> >>> [2020-04-30 21:45:54.251637] I [MSGID: 115072]
>> >>> [server-rpc-fops_v2.c:1681:server4_setattr_cbk]
>> >>> 0-androidpolice_data3-server: 5725811: SETATTR /
>> >>>
>> >
>>
androidpolice.com/public/wp-content/uploads/2019/03/cielo-breez-plus-hero.png
>> >>> (d4556eb4-f15b-412c-a42a-32b4438af557), client:
>> >>>
>>
>>
>CTX_ID:32e2d636-038a-472d-8199-007555d1805f-GRAPH_ID:0-PID:14265-HOST:nexus2-PC_NAME:androidpolice_data3-client-2-RECON_NO:-1,
>> >>> error-xlator: androidpolice_data3-access-control
[Operation not
>> >permitted]
>> >>> [2020-04-30 21:49:10.439701] I [MSGID: 115072]
>> >>> [server-rpc-fops_v2.c:1680:server4_setattr_cbk]
>> >>> 0-androidpolice_data3-server: 201833: SETATTR /
>> >>> androidpolice.com/public/wp-content/uploads
>> >>> (2692eeba-1ebe-49b6-927f-1dfbcd227591), client:
>> >>>
>>
>>
>CTX_ID:af341e80-70ff-4d23-99ef-3d846a546fc9-GRAPH_ID:0-PID:2358-HOST:forge-PC_NAME:androidpolice_data3-client-3-RECON_NO:-2,
>> >>> error-xlator: androidpolice_data3-access-control
[Operation not
>> >permitted]
>> >>> [2020-04-30 21:49:10.453724] I [MSGID: 115072]
>> >>> [server-rpc-fops_v2.c:1680:server4_setattr_cbk]
>> >>> 0-androidpolice_data3-server: 201842: SETATTR /
>> >>> androidpolice.com/public/wp-content/uploads
>> >>> (2692eeba-1ebe-49b6-927f-1dfbcd227591), client:
>> >>>
>>
>>
>CTX_ID:af341e80-70ff-4d23-99ef-3d846a546fc9-GRAPH_ID:0-PID:2358-HOST:forge-PC_NAME:androidpolice_data3-client-3-RECON_NO:-2,
>> >>> error-xlator: androidpolice_data3-access-control
[Operation not
>> >permitted]
>> >>> [2020-04-30 21:49:16.224662] I [MSGID: 115072]
>> >>> [server-rpc-fops_v2.c:1680:server4_setattr_cbk]
>> >>> 0-androidpolice_data3-server: 202865: SETATTR /
>> >>> androidpolice.com/public/wp-content/uploads
>> >>> (2692eeba-1ebe-49b6-927f-1dfbcd227591), client:
>> >>>
>>
>>
>CTX_ID:32e2d636-038a-472d-8199-007555d1805f-GRAPH_ID:0-PID:14265-HOST:nexus2-PC_NAME:androidpolice_data3-client-3-RECON_NO:-2,
>> >>> error-xlator: androidpolice_data3-access-control
[Operation not
>> >permitted]
>> >>>
>> >>> There's also lots of self-healing happening that I
didn't expect at
>> >all,
>> >>> since the upgrade only took ~10-15s.
>> >>> [2020-04-30 21:47:38.714448] I [MSGID: 108026]
>> >>> [afr-self-heal-metadata.c:52:__afr_selfheal_metadata_do]
>> >>> 0-apkmirror_data1-replicate-0: performing metadata
selfheal on
>> >>> 4a6ba2d7-7ad8-4113-862b-02e4934a3461
>> >>> [2020-04-30 21:47:38.765033] I [MSGID: 108026]
>> >>> [afr-self-heal-common.c:1723:afr_log_selfheal]
>> >>> 0-apkmirror_data1-replicate-0: Completed metadata selfheal
on
>> >>> 4a6ba2d7-7ad8-4113-862b-02e4934a3461. sources=[3]  sinks=0
1 2
>> >>> [2020-04-30 21:47:38.765289] I [MSGID: 108026]
>> >>> [afr-self-heal-metadata.c:52:__afr_selfheal_metadata_do]
>> >>> 0-apkmirror_data1-replicate-0: performing metadata
selfheal on
>> >>> f3c62a41-1864-4e75-9883-4357a7091296
>> >>> [2020-04-30 21:47:38.800987] I [MSGID: 108026]
>> >>> [afr-self-heal-common.c:1723:afr_log_selfheal]
>> >>> 0-apkmirror_data1-replicate-0: Completed metadata selfheal
on
>> >>> f3c62a41-1864-4e75-9883-4357a7091296. sources=[3]  sinks=0
1 2
>> >>>
>> >>>
>> >>> I'm also seeing "remote operation failed"
and "writing to fuse
>> >device
>> >>> failed: No such file or directory" messages
>> >>> [2020-04-30 21:46:34.891957] I [MSGID: 108026]
>> >>> [afr-self-heal-common.c:1723:afr_log_selfheal]
>> >>> 0-androidpolice_data3-replicate-0: Completed metadata
selfheal on
>> >>> 2692eeba-1ebe-49b6-927f-1dfbcd227591. sources=0 1 [2] 
sinks=3
>> >>> [2020-04-30 21:45:36.127412] W [MSGID: 114031]
>> >>> [client-rpc-fops_v2.c:1985:client4_0_setattr_cbk]
>> >>> 0-androidpolice_data3-client-0: remote operation failed
[Operation
>> >not
>> >>> permitted]
>> >>> [2020-04-30 21:45:36.345924] W [MSGID: 114031]
>> >>> [client-rpc-fops_v2.c:1985:client4_0_setattr_cbk]
>> >>> 0-androidpolice_data3-client-1: remote operation failed
[Operation
>> >not
>> >>> permitted]
>> >>> [2020-04-30 21:46:35.291853] I [MSGID: 108031]
>> >>> [afr-common.c:2543:afr_local_discovery_cbk]
>> >>> 0-androidpolice_data3-replicate-0: selecting local
read_child
>> >>> androidpolice_data3-client-2
>> >>> [2020-04-30 21:46:35.977342] I [MSGID: 108026]
>> >>> [afr-self-heal-metadata.c:52:__afr_selfheal_metadata_do]
>> >>> 0-androidpolice_data3-replicate-0: performing metadata
selfheal on
>> >>> 2692eeba-1ebe-49b6-927f-1dfbcd227591
>> >>> [2020-04-30 21:46:36.006607] I [MSGID: 108026]
>> >>> [afr-self-heal-common.c:1723:afr_log_selfheal]
>> >>> 0-androidpolice_data3-replicate-0: Completed metadata
selfheal on
>> >>> 2692eeba-1ebe-49b6-927f-1dfbcd227591. sources=0 1 [2] 
sinks=3
>> >>> [2020-04-30 21:46:37.245599] E
>> >[fuse-bridge.c:219:check_and_dump_fuse_W]
>> >>> (-->
>>
>/usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x17d)[0x7fd13d50624d]
>> >>> (-->
>> >>>
>>
>/usr/lib64/glusterfs/5.13/xlator/mount/fuse.so(+0x849a)[0x7fd1398e949a]
>> >>> (-->
>> >>>
>>
>/usr/lib64/glusterfs/5.13/xlator/mount/fuse.so(+0x87bb)[0x7fd1398e97bb]
>> >>> (--> /lib64/libpthread.so.0(+0x84f9)[0x7fd13ca564f9]
(-->
>> >>> /lib64/libc.so.6(clone+0x3f)[0x7fd13c78ef2f] )))))
0-glusterfs-fuse:
>> >>> writing to fuse device failed: No such file or directory
>> >>> [2020-04-30 21:46:50.864797] E
>> >[fuse-bridge.c:219:check_and_dump_fuse_W]
>> >>> (-->
>>
>/usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x17d)[0x7fd13d50624d]
>> >>> (-->
>> >>>
>>
>/usr/lib64/glusterfs/5.13/xlator/mount/fuse.so(+0x849a)[0x7fd1398e949a]
>> >>> (-->
>> >>>
>>
>/usr/lib64/glusterfs/5.13/xlator/mount/fuse.so(+0x87bb)[0x7fd1398e97bb]
>> >>> (--> /lib64/libpthread.so.0(+0x84f9)[0x7fd13ca564f9]
(-->
>> >>> /lib64/libc.so.6(clone+0x3f)[0x7fd13c78ef2f] )))))
0-glusterfs-fuse:
>> >>> writing to fuse device failed: No such file or directory
>> >>>
>> >>> The number of items being healed is going up and down
wildly, from 0
>> >to
>> >>> 8000+ and sometimes taking a really long time to return a
value. I'm
>> >really
>> >>> worried as this is a production system, and I didn't
observe this in
>> >our
>> >>> test system.
>> >>>
>> >>>
>> >>>
>> >>> gluster v heal apkmirror_data1 info summary
>> >>> Brick nexus2:/mnt/nexus2_block1/apkmirror_data1
>> >>> Status: Connected
>> >>> Total Number of entries: 27
>> >>> Number of entries in heal pending: 27
>> >>> Number of entries in split-brain: 0
>> >>> Number of entries possibly healing: 0
>> >>>
>> >>> Brick forge:/mnt/forge_block1/apkmirror_data1
>> >>> Status: Connected
>> >>> Total Number of entries: 27
>> >>> Number of entries in heal pending: 27
>> >>> Number of entries in split-brain: 0
>> >>> Number of entries possibly healing: 0
>> >>>
>> >>> Brick hive:/mnt/hive_block1/apkmirror_data1
>> >>> Status: Connected
>> >>> Total Number of entries: 27
>> >>> Number of entries in heal pending: 27
>> >>> Number of entries in split-brain: 0
>> >>> Number of entries possibly healing: 0
>> >>>
>> >>> Brick citadel:/mnt/citadel_block1/apkmirror_data1
>> >>> Status: Connected
>> >>> Total Number of entries: 8540
>> >>> Number of entries in heal pending: 8540
>> >>> Number of entries in split-brain: 0
>> >>> Number of entries possibly healing: 0
>> >>>
>> >>>
>> >>>
>> >>> gluster v heal androidpolice_data3 info summary
>> >>> Brick nexus2:/mnt/nexus2_block4/androidpolice_data3
>> >>> Status: Connected
>> >>> Total Number of entries: 1
>> >>> Number of entries in heal pending: 1
>> >>> Number of entries in split-brain: 0
>> >>> Number of entries possibly healing: 0
>> >>>
>> >>> Brick forge:/mnt/forge_block4/androidpolice_data3
>> >>> Status: Connected
>> >>> Total Number of entries: 1
>> >>> Number of entries in heal pending: 1
>> >>> Number of entries in split-brain: 0
>> >>> Number of entries possibly healing: 0
>> >>>
>> >>> Brick hive:/mnt/hive_block4/androidpolice_data3
>> >>> Status: Connected
>> >>> Total Number of entries: 1
>> >>> Number of entries in heal pending: 1
>> >>> Number of entries in split-brain: 0
>> >>> Number of entries possibly healing: 0
>> >>>
>> >>> Brick citadel:/mnt/citadel_block4/androidpolice_data3
>> >>> Status: Connected
>> >>> Total Number of entries: 1149
>> >>> Number of entries in heal pending: 1149
>> >>> Number of entries in split-brain: 0
>> >>> Number of entries possibly healing: 0
>> >>>
>> >>>
>> >>> What should I do at this point? The files I tested seem to
be
>> >replicating
>> >>> correctly, but I don't know if it's the case for
all of them, and
>> >the heals
>> >>> going up and down, and all these log messages are making
me very
>> >nervous.
>> >>>
>> >>> Thank you.
>> >>>
>> >>> Sincerely,
>> >>> Artem
>> >>>
>> >>> --
>> >>> Founder, Android Police
<http://www.androidpolice.com>, APK Mirror
>> >>> <http://www.apkmirror.com/>, Illogical Robot LLC
>> >>> beerpla.net | @ArtemR <http://twitter.com/ArtemR>
>> >>>
>> >>
>>
>> I's not supported  , but usually it works.
>>
>> In worst case scenario,  you can remove the node, wipe gluster on the
>> node, reinstall the packages and add it - it will require full heal of
the
>> brick and as you have previously reported could lead to performance
>> degradation.
>>
>> I think you are on SLES, but I could be wrong . Do you have btrfs or
LVM
>> snapshots to revert from ?
>>
>> Best Regards,
>> Strahil Nikolov
>>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.gluster.org/pipermail/gluster-users/attachments/20200501/f899699c/attachment.html>

Strahil Nikolov

2020-May-02 07:47 UTC

head link

[Gluster-users] Upgrade from 5.13 to 7.5 full of weird messages

On May 1, 2020 8:03:50 PM GMT+03:00, Artem Russakovskii <archon810 at
gmail.com> wrote:>The good news is the downgrade seems to have worked and was painless.
>
>zypper install --oldpackage glusterfs-5.13, restart gluster, and almost
>immediately there are no heal pending entries anymore.
>
>The only things still showing up in the logs, besides some healing is
>0-glusterfs-fuse:
>writing to fuse device failed: No such file or directory:
>==> mnt-androidpolice_data3.log <=>[2020-05-01 16:54:21.085643] E
>[fuse-bridge.c:219:check_and_dump_fuse_W]
>(-->
>/usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x17d)[0x7fd13d50624d]
>(-->
>/usr/lib64/glusterfs/5.13/xlator/mount/fuse.so(+0x849a)[0x7fd1398e949a]
>(-->
>/usr/lib64/glusterfs/5.13/xlator/mount/fuse.so(+0x87bb)[0x7fd1398e97bb]
>(--> /lib64/libpthread.so.0(+0x84f9)[0x7fd13ca564f9] (-->
>/lib64/libc.so.6(clone+0x3f)[0x7fd13c78ef2f] ))))) 0-glusterfs-fuse:
>writing to fuse device failed: No such file or directory
>==> mnt-apkmirror_data1.log <=>[2020-05-01 16:54:21.268842] E
>[fuse-bridge.c:219:check_and_dump_fuse_W]
>(-->
>/usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x17d)[0x7fdf2b0a624d]
>(-->
>/usr/lib64/glusterfs/5.13/xlator/mount/fuse.so(+0x849a)[0x7fdf2748949a]
>(-->
>/usr/lib64/glusterfs/5.13/xlator/mount/fuse.so(+0x87bb)[0x7fdf274897bb]
>(--> /lib64/libpthread.so.0(+0x84f9)[0x7fdf2a5f64f9] (-->
>/lib64/libc.so.6(clone+0x3f)[0x7fdf2a32ef2f] ))))) 0-glusterfs-fuse:
>writing to fuse device failed: No such file or directory
>
>It'd be very helpful if it had more info about what failed to write and
>why.
>
>I'd still really love to see the analysis of this failed upgrade from
>core
>gluster maintainers to see what needs fixing and how we can upgrade in
>the
>future.
>
>Thanks.
>
>Sincerely,
>Artem
>
>--
>Founder, Android Police <http://www.androidpolice.com>, APK Mirror
><http://www.apkmirror.com/>, Illogical Robot LLC
>beerpla.net | @ArtemR <http://twitter.com/ArtemR>
>
>
>On Fri, May 1, 2020 at 7:25 AM Artem Russakovskii <archon810 at
gmail.com>
>wrote:
>
>> I do not have snapshots, no. I have a general file based backup, but
>also
>> the other 3 nodes are up.
>>
>> OpenSUSE 15.1.
>>
>> If I try to downgrade and it doesn't work, what's the brick
>replacement
>> scenario - is this still accurate?
>>
>https://docs.gluster.org/en/latest/Administrator%20Guide/Managing%20Volumes/#replace-brick
>>
>> Any feedback about the issues themselves yet please? Specifically, is
>> there a chance this is happening because of the mismatched gluster
>> versions? Though, what's the solution then?
>>
>> On Fri, May 1, 2020, 1:07 AM Strahil Nikolov <hunter86_bg at
yahoo.com>
>> wrote:
>>
>>> On May 1, 2020 1:25:17 AM GMT+03:00, Artem Russakovskii <
>>> archon810 at gmail.com> wrote:
>>> >If more time is needed to analyze this, is this an option? Shut
>down
>>> >7.5,
>>> >downgrade it back to 5.13 and restart, or would this screw
>something up
>>> >badly? I didn't up the op-version yet.
>>> >
>>> >Thanks.
>>> >
>>> >Sincerely,
>>> >Artem
>>> >
>>> >--
>>> >Founder, Android Police <http://www.androidpolice.com>,
APK Mirror
>>> ><http://www.apkmirror.com/>, Illogical Robot LLC
>>> >beerpla.net | @ArtemR <http://twitter.com/ArtemR>
>>> >
>>> >
>>> >On Thu, Apr 30, 2020 at 3:13 PM Artem Russakovskii
>>> ><archon810 at gmail.com>
>>> >wrote:
>>> >
>>> >> The number of heal pending on citadel, the one that was
upgraded
>to
>>> >7.5,
>>> >> has now gone to 10s of thousands and continues to go up.
>>> >>
>>> >> Sincerely,
>>> >> Artem
>>> >>
>>> >> --
>>> >> Founder, Android Police
<http://www.androidpolice.com>, APK
>Mirror
>>> >> <http://www.apkmirror.com/>, Illogical Robot LLC
>>> >> beerpla.net | @ArtemR <http://twitter.com/ArtemR>
>>> >>
>>> >>
>>> >> On Thu, Apr 30, 2020 at 2:57 PM Artem Russakovskii
>>> ><archon810 at gmail.com>
>>> >> wrote:
>>> >>
>>> >>> Hi all,
>>> >>>
>>> >>> Today, I decided to upgrade one of the four servers
(citadel) we
>>> >have to
>>> >>> 7.5 from 5.13. There are 2 volumes, 1x4 replicate, and
fuse
>mounts
>>> >(I sent
>>> >>> the full details earlier in another message). If
everything
>looked
>>> >OK, I
>>> >>> would have proceeded the rolling upgrade for all of
them,
>following
>>> >the
>>> >>> full heal.
>>> >>>
>>> >>> However, as soon as I upgraded and restarted, the logs
filled
>with
>>> >>> messages like these:
>>> >>>
>>> >>> [2020-04-30 21:39:21.316149] E
>>> >>> [rpcsvc.c:567:rpcsvc_check_and_reply_error] 0-rpcsvc:
rpc actor
>>> >>> (1298437:400:17) failed to complete successfully
>>> >>> [2020-04-30 21:39:21.382891] E
>>> >>> [rpcsvc.c:567:rpcsvc_check_and_reply_error] 0-rpcsvc:
rpc actor
>>> >>> (1298437:400:17) failed to complete successfully
>>> >>> [2020-04-30 21:39:21.442440] E
>>> >>> [rpcsvc.c:567:rpcsvc_check_and_reply_error] 0-rpcsvc:
rpc actor
>>> >>> (1298437:400:17) failed to complete successfully
>>> >>> [2020-04-30 21:39:21.445587] E
>>> >>> [rpcsvc.c:567:rpcsvc_check_and_reply_error] 0-rpcsvc:
rpc actor
>>> >>> (1298437:400:17) failed to complete successfully
>>> >>> [2020-04-30 21:39:21.571398] E
>>> >>> [rpcsvc.c:567:rpcsvc_check_and_reply_error] 0-rpcsvc:
rpc actor
>>> >>> (1298437:400:17) failed to complete successfully
>>> >>> [2020-04-30 21:39:21.668192] E
>>> >>> [rpcsvc.c:567:rpcsvc_check_and_reply_error] 0-rpcsvc:
rpc actor
>>> >>> (1298437:400:17) failed to complete successfully
>>> >>>
>>> >>>
>>> >>> The message "I [MSGID: 108031]
>>> >>> [afr-common.c:2581:afr_local_discovery_cbk]
>>> >>> 0-androidpolice_data3-replicate-0: selecting local
read_child
>>> >>> androidpolice_data3-client-3" repeated 10 times
between
>[2020-04-30
>>> >>> 21:46:41.854675] and [2020-04-30 21:48:20.206323]
>>> >>> The message "W [MSGID: 114031]
>>> >>> [client-rpc-fops_v2.c:850:client4_0_setxattr_cbk]
>>> >>> 0-androidpolice_data3-client-1: remote operation
failed
>[Transport
>>> >endpoint
>>> >>> is not connected]" repeated 264 times between
[2020-04-30
>>> >21:46:32.129567]
>>> >>> and [2020-04-30 21:48:29.905008]
>>> >>> The message "W [MSGID: 114031]
>>> >>> [client-rpc-fops_v2.c:850:client4_0_setxattr_cbk]
>>> >>> 0-androidpolice_data3-client-0: remote operation
failed
>[Transport
>>> >endpoint
>>> >>> is not connected]" repeated 264 times between
[2020-04-30
>>> >21:46:32.129602]
>>> >>> and [2020-04-30 21:48:29.905040]
>>> >>> The message "W [MSGID: 114031]
>>> >>> [client-rpc-fops_v2.c:850:client4_0_setxattr_cbk]
>>> >>> 0-androidpolice_data3-client-2: remote operation
failed
>[Transport
>>> >endpoint
>>> >>> is not connected]" repeated 264 times between
[2020-04-30
>>> >21:46:32.129512]
>>> >>> and [2020-04-30 21:48:29.905047]
>>> >>>
>>> >>>
>>> >>>
>>> >>> Once in a while, I'm seeing this:
>>> >>> ==> bricks/mnt-hive_block4-androidpolice_data3.log
<=>>> >>> [2020-04-30 21:45:54.251637] I [MSGID: 115072]
>>> >>> [server-rpc-fops_v2.c:1681:server4_setattr_cbk]
>>> >>> 0-androidpolice_data3-server: 5725811: SETATTR /
>>> >>>
>>> >
>>>
>androidpolice.com/public/wp-content/uploads/2019/03/cielo-breez-plus-hero.png
>>> >>> (d4556eb4-f15b-412c-a42a-32b4438af557), client:
>>> >>>
>>>
>>>
>>CTX_ID:32e2d636-038a-472d-8199-007555d1805f-GRAPH_ID:0-PID:14265-HOST:nexus2-PC_NAME:androidpolice_data3-client-2-RECON_NO:-1,
>>> >>> error-xlator: androidpolice_data3-access-control
[Operation not
>>> >permitted]
>>> >>> [2020-04-30 21:49:10.439701] I [MSGID: 115072]
>>> >>> [server-rpc-fops_v2.c:1680:server4_setattr_cbk]
>>> >>> 0-androidpolice_data3-server: 201833: SETATTR /
>>> >>> androidpolice.com/public/wp-content/uploads
>>> >>> (2692eeba-1ebe-49b6-927f-1dfbcd227591), client:
>>> >>>
>>>
>>>
>>CTX_ID:af341e80-70ff-4d23-99ef-3d846a546fc9-GRAPH_ID:0-PID:2358-HOST:forge-PC_NAME:androidpolice_data3-client-3-RECON_NO:-2,
>>> >>> error-xlator: androidpolice_data3-access-control
[Operation not
>>> >permitted]
>>> >>> [2020-04-30 21:49:10.453724] I [MSGID: 115072]
>>> >>> [server-rpc-fops_v2.c:1680:server4_setattr_cbk]
>>> >>> 0-androidpolice_data3-server: 201842: SETATTR /
>>> >>> androidpolice.com/public/wp-content/uploads
>>> >>> (2692eeba-1ebe-49b6-927f-1dfbcd227591), client:
>>> >>>
>>>
>>>
>>CTX_ID:af341e80-70ff-4d23-99ef-3d846a546fc9-GRAPH_ID:0-PID:2358-HOST:forge-PC_NAME:androidpolice_data3-client-3-RECON_NO:-2,
>>> >>> error-xlator: androidpolice_data3-access-control
[Operation not
>>> >permitted]
>>> >>> [2020-04-30 21:49:16.224662] I [MSGID: 115072]
>>> >>> [server-rpc-fops_v2.c:1680:server4_setattr_cbk]
>>> >>> 0-androidpolice_data3-server: 202865: SETATTR /
>>> >>> androidpolice.com/public/wp-content/uploads
>>> >>> (2692eeba-1ebe-49b6-927f-1dfbcd227591), client:
>>> >>>
>>>
>>>
>>CTX_ID:32e2d636-038a-472d-8199-007555d1805f-GRAPH_ID:0-PID:14265-HOST:nexus2-PC_NAME:androidpolice_data3-client-3-RECON_NO:-2,
>>> >>> error-xlator: androidpolice_data3-access-control
[Operation not
>>> >permitted]
>>> >>>
>>> >>> There's also lots of self-healing happening that I
didn't expect
>at
>>> >all,
>>> >>> since the upgrade only took ~10-15s.
>>> >>> [2020-04-30 21:47:38.714448] I [MSGID: 108026]
>>> >>>
[afr-self-heal-metadata.c:52:__afr_selfheal_metadata_do]
>>> >>> 0-apkmirror_data1-replicate-0: performing metadata
selfheal on
>>> >>> 4a6ba2d7-7ad8-4113-862b-02e4934a3461
>>> >>> [2020-04-30 21:47:38.765033] I [MSGID: 108026]
>>> >>> [afr-self-heal-common.c:1723:afr_log_selfheal]
>>> >>> 0-apkmirror_data1-replicate-0: Completed metadata
selfheal on
>>> >>> 4a6ba2d7-7ad8-4113-862b-02e4934a3461. sources=[3] 
sinks=0 1 2
>>> >>> [2020-04-30 21:47:38.765289] I [MSGID: 108026]
>>> >>>
[afr-self-heal-metadata.c:52:__afr_selfheal_metadata_do]
>>> >>> 0-apkmirror_data1-replicate-0: performing metadata
selfheal on
>>> >>> f3c62a41-1864-4e75-9883-4357a7091296
>>> >>> [2020-04-30 21:47:38.800987] I [MSGID: 108026]
>>> >>> [afr-self-heal-common.c:1723:afr_log_selfheal]
>>> >>> 0-apkmirror_data1-replicate-0: Completed metadata
selfheal on
>>> >>> f3c62a41-1864-4e75-9883-4357a7091296. sources=[3] 
sinks=0 1 2
>>> >>>
>>> >>>
>>> >>> I'm also seeing "remote operation
failed" and "writing to fuse
>>> >device
>>> >>> failed: No such file or directory" messages
>>> >>> [2020-04-30 21:46:34.891957] I [MSGID: 108026]
>>> >>> [afr-self-heal-common.c:1723:afr_log_selfheal]
>>> >>> 0-androidpolice_data3-replicate-0: Completed metadata
selfheal
>on
>>> >>> 2692eeba-1ebe-49b6-927f-1dfbcd227591. sources=0 1 [2] 
sinks=3
>>> >>> [2020-04-30 21:45:36.127412] W [MSGID: 114031]
>>> >>> [client-rpc-fops_v2.c:1985:client4_0_setattr_cbk]
>>> >>> 0-androidpolice_data3-client-0: remote operation
failed
>[Operation
>>> >not
>>> >>> permitted]
>>> >>> [2020-04-30 21:45:36.345924] W [MSGID: 114031]
>>> >>> [client-rpc-fops_v2.c:1985:client4_0_setattr_cbk]
>>> >>> 0-androidpolice_data3-client-1: remote operation
failed
>[Operation
>>> >not
>>> >>> permitted]
>>> >>> [2020-04-30 21:46:35.291853] I [MSGID: 108031]
>>> >>> [afr-common.c:2543:afr_local_discovery_cbk]
>>> >>> 0-androidpolice_data3-replicate-0: selecting local
read_child
>>> >>> androidpolice_data3-client-2
>>> >>> [2020-04-30 21:46:35.977342] I [MSGID: 108026]
>>> >>>
[afr-self-heal-metadata.c:52:__afr_selfheal_metadata_do]
>>> >>> 0-androidpolice_data3-replicate-0: performing metadata
selfheal
>on
>>> >>> 2692eeba-1ebe-49b6-927f-1dfbcd227591
>>> >>> [2020-04-30 21:46:36.006607] I [MSGID: 108026]
>>> >>> [afr-self-heal-common.c:1723:afr_log_selfheal]
>>> >>> 0-androidpolice_data3-replicate-0: Completed metadata
selfheal
>on
>>> >>> 2692eeba-1ebe-49b6-927f-1dfbcd227591. sources=0 1 [2] 
sinks=3
>>> >>> [2020-04-30 21:46:37.245599] E
>>> >[fuse-bridge.c:219:check_and_dump_fuse_W]
>>> >>> (-->
>>>
>>/usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x17d)[0x7fd13d50624d]
>>> >>> (-->
>>> >>>
>>>
>>/usr/lib64/glusterfs/5.13/xlator/mount/fuse.so(+0x849a)[0x7fd1398e949a]
>>> >>> (-->
>>> >>>
>>>
>>/usr/lib64/glusterfs/5.13/xlator/mount/fuse.so(+0x87bb)[0x7fd1398e97bb]
>>> >>> (-->
/lib64/libpthread.so.0(+0x84f9)[0x7fd13ca564f9] (-->
>>> >>> /lib64/libc.so.6(clone+0x3f)[0x7fd13c78ef2f] )))))
>0-glusterfs-fuse:
>>> >>> writing to fuse device failed: No such file or
directory
>>> >>> [2020-04-30 21:46:50.864797] E
>>> >[fuse-bridge.c:219:check_and_dump_fuse_W]
>>> >>> (-->
>>>
>>/usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x17d)[0x7fd13d50624d]
>>> >>> (-->
>>> >>>
>>>
>>/usr/lib64/glusterfs/5.13/xlator/mount/fuse.so(+0x849a)[0x7fd1398e949a]
>>> >>> (-->
>>> >>>
>>>
>>/usr/lib64/glusterfs/5.13/xlator/mount/fuse.so(+0x87bb)[0x7fd1398e97bb]
>>> >>> (-->
/lib64/libpthread.so.0(+0x84f9)[0x7fd13ca564f9] (-->
>>> >>> /lib64/libc.so.6(clone+0x3f)[0x7fd13c78ef2f] )))))
>0-glusterfs-fuse:
>>> >>> writing to fuse device failed: No such file or
directory
>>> >>>
>>> >>> The number of items being healed is going up and down
wildly,
>from 0
>>> >to
>>> >>> 8000+ and sometimes taking a really long time to
return a value.
>I'm
>>> >really
>>> >>> worried as this is a production system, and I
didn't observe
>this in
>>> >our
>>> >>> test system.
>>> >>>
>>> >>>
>>> >>>
>>> >>> gluster v heal apkmirror_data1 info summary
>>> >>> Brick nexus2:/mnt/nexus2_block1/apkmirror_data1
>>> >>> Status: Connected
>>> >>> Total Number of entries: 27
>>> >>> Number of entries in heal pending: 27
>>> >>> Number of entries in split-brain: 0
>>> >>> Number of entries possibly healing: 0
>>> >>>
>>> >>> Brick forge:/mnt/forge_block1/apkmirror_data1
>>> >>> Status: Connected
>>> >>> Total Number of entries: 27
>>> >>> Number of entries in heal pending: 27
>>> >>> Number of entries in split-brain: 0
>>> >>> Number of entries possibly healing: 0
>>> >>>
>>> >>> Brick hive:/mnt/hive_block1/apkmirror_data1
>>> >>> Status: Connected
>>> >>> Total Number of entries: 27
>>> >>> Number of entries in heal pending: 27
>>> >>> Number of entries in split-brain: 0
>>> >>> Number of entries possibly healing: 0
>>> >>>
>>> >>> Brick citadel:/mnt/citadel_block1/apkmirror_data1
>>> >>> Status: Connected
>>> >>> Total Number of entries: 8540
>>> >>> Number of entries in heal pending: 8540
>>> >>> Number of entries in split-brain: 0
>>> >>> Number of entries possibly healing: 0
>>> >>>
>>> >>>
>>> >>>
>>> >>> gluster v heal androidpolice_data3 info summary
>>> >>> Brick nexus2:/mnt/nexus2_block4/androidpolice_data3
>>> >>> Status: Connected
>>> >>> Total Number of entries: 1
>>> >>> Number of entries in heal pending: 1
>>> >>> Number of entries in split-brain: 0
>>> >>> Number of entries possibly healing: 0
>>> >>>
>>> >>> Brick forge:/mnt/forge_block4/androidpolice_data3
>>> >>> Status: Connected
>>> >>> Total Number of entries: 1
>>> >>> Number of entries in heal pending: 1
>>> >>> Number of entries in split-brain: 0
>>> >>> Number of entries possibly healing: 0
>>> >>>
>>> >>> Brick hive:/mnt/hive_block4/androidpolice_data3
>>> >>> Status: Connected
>>> >>> Total Number of entries: 1
>>> >>> Number of entries in heal pending: 1
>>> >>> Number of entries in split-brain: 0
>>> >>> Number of entries possibly healing: 0
>>> >>>
>>> >>> Brick citadel:/mnt/citadel_block4/androidpolice_data3
>>> >>> Status: Connected
>>> >>> Total Number of entries: 1149
>>> >>> Number of entries in heal pending: 1149
>>> >>> Number of entries in split-brain: 0
>>> >>> Number of entries possibly healing: 0
>>> >>>
>>> >>>
>>> >>> What should I do at this point? The files I tested
seem to be
>>> >replicating
>>> >>> correctly, but I don't know if it's the case
for all of them,
>and
>>> >the heals
>>> >>> going up and down, and all these log messages are
making me very
>>> >nervous.
>>> >>>
>>> >>> Thank you.
>>> >>>
>>> >>> Sincerely,
>>> >>> Artem
>>> >>>
>>> >>> --
>>> >>> Founder, Android Police
<http://www.androidpolice.com>, APK
>Mirror
>>> >>> <http://www.apkmirror.com/>, Illogical Robot LLC
>>> >>> beerpla.net | @ArtemR
<http://twitter.com/ArtemR>
>>> >>>
>>> >>
>>>
>>> I's not supported  , but usually it works.
>>>
>>> In worst case scenario,  you can remove the node, wipe gluster on
>the
>>> node, reinstall the packages and add it - it will require full heal
>of the
>>> brick and as you have previously reported could lead to performance
>>> degradation.
>>>
>>> I think you are on SLES, but I could be wrong . Do you have btrfs
or
>LVM
>>> snapshots to revert from ?
>>>
>>> Best Regards,
>>> Strahil Nikolov
>>>
>>
Hi Artem,

You can increase the brick log level following
https://access.redhat.com/documentation/en-us/red_hat_gluster_storage/3/html/administration_guide/configuring_the_log_level
but keep in mind that logs grow quite fast - so don't keep them above the
current level for too much time.


Do you have a geo replication running ?

About the migration issue - I have no clue why this happened. Last time I
skipped a major release(3.12  to 5.5) I got a huge trouble (all files ownership
was switched to root)  and I have the feeling  that it won't happen again if
you go through v6.

Best Regards,
Strahil Nikolov

Gluster users - May 2020 - Upgrade from 5.13 to 7.5 full of weird messages

[Gluster-users] Upgrade from 5.13 to 7.5 full of weird messages

[Gluster-users] Upgrade from 5.13 to 7.5 full of weird messages