Strahil Nikolov
2020-May-01 08:07 UTC
[Gluster-users] Upgrade from 5.13 to 7.5 full of weird messages
On May 1, 2020 1:25:17 AM GMT+03:00, Artem Russakovskii <archon810 at gmail.com> wrote:>If more time is needed to analyze this, is this an option? Shut down >7.5, >downgrade it back to 5.13 and restart, or would this screw something up >badly? I didn't up the op-version yet. > >Thanks. > >Sincerely, >Artem > >-- >Founder, Android Police <http://www.androidpolice.com>, APK Mirror ><http://www.apkmirror.com/>, Illogical Robot LLC >beerpla.net | @ArtemR <http://twitter.com/ArtemR> > > >On Thu, Apr 30, 2020 at 3:13 PM Artem Russakovskii ><archon810 at gmail.com> >wrote: > >> The number of heal pending on citadel, the one that was upgraded to >7.5, >> has now gone to 10s of thousands and continues to go up. >> >> Sincerely, >> Artem >> >> -- >> Founder, Android Police <http://www.androidpolice.com>, APK Mirror >> <http://www.apkmirror.com/>, Illogical Robot LLC >> beerpla.net | @ArtemR <http://twitter.com/ArtemR> >> >> >> On Thu, Apr 30, 2020 at 2:57 PM Artem Russakovskii ><archon810 at gmail.com> >> wrote: >> >>> Hi all, >>> >>> Today, I decided to upgrade one of the four servers (citadel) we >have to >>> 7.5 from 5.13. There are 2 volumes, 1x4 replicate, and fuse mounts >(I sent >>> the full details earlier in another message). If everything looked >OK, I >>> would have proceeded the rolling upgrade for all of them, following >the >>> full heal. >>> >>> However, as soon as I upgraded and restarted, the logs filled with >>> messages like these: >>> >>> [2020-04-30 21:39:21.316149] E >>> [rpcsvc.c:567:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor >>> (1298437:400:17) failed to complete successfully >>> [2020-04-30 21:39:21.382891] E >>> [rpcsvc.c:567:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor >>> (1298437:400:17) failed to complete successfully >>> [2020-04-30 21:39:21.442440] E >>> [rpcsvc.c:567:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor >>> (1298437:400:17) failed to complete successfully >>> [2020-04-30 21:39:21.445587] E >>> [rpcsvc.c:567:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor >>> (1298437:400:17) failed to complete successfully >>> [2020-04-30 21:39:21.571398] E >>> [rpcsvc.c:567:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor >>> (1298437:400:17) failed to complete successfully >>> [2020-04-30 21:39:21.668192] E >>> [rpcsvc.c:567:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor >>> (1298437:400:17) failed to complete successfully >>> >>> >>> The message "I [MSGID: 108031] >>> [afr-common.c:2581:afr_local_discovery_cbk] >>> 0-androidpolice_data3-replicate-0: selecting local read_child >>> androidpolice_data3-client-3" repeated 10 times between [2020-04-30 >>> 21:46:41.854675] and [2020-04-30 21:48:20.206323] >>> The message "W [MSGID: 114031] >>> [client-rpc-fops_v2.c:850:client4_0_setxattr_cbk] >>> 0-androidpolice_data3-client-1: remote operation failed [Transport >endpoint >>> is not connected]" repeated 264 times between [2020-04-30 >21:46:32.129567] >>> and [2020-04-30 21:48:29.905008] >>> The message "W [MSGID: 114031] >>> [client-rpc-fops_v2.c:850:client4_0_setxattr_cbk] >>> 0-androidpolice_data3-client-0: remote operation failed [Transport >endpoint >>> is not connected]" repeated 264 times between [2020-04-30 >21:46:32.129602] >>> and [2020-04-30 21:48:29.905040] >>> The message "W [MSGID: 114031] >>> [client-rpc-fops_v2.c:850:client4_0_setxattr_cbk] >>> 0-androidpolice_data3-client-2: remote operation failed [Transport >endpoint >>> is not connected]" repeated 264 times between [2020-04-30 >21:46:32.129512] >>> and [2020-04-30 21:48:29.905047] >>> >>> >>> >>> Once in a while, I'm seeing this: >>> ==> bricks/mnt-hive_block4-androidpolice_data3.log <=>>> [2020-04-30 21:45:54.251637] I [MSGID: 115072] >>> [server-rpc-fops_v2.c:1681:server4_setattr_cbk] >>> 0-androidpolice_data3-server: 5725811: SETATTR / >>> >androidpolice.com/public/wp-content/uploads/2019/03/cielo-breez-plus-hero.png >>> (d4556eb4-f15b-412c-a42a-32b4438af557), client: >>> >CTX_ID:32e2d636-038a-472d-8199-007555d1805f-GRAPH_ID:0-PID:14265-HOST:nexus2-PC_NAME:androidpolice_data3-client-2-RECON_NO:-1, >>> error-xlator: androidpolice_data3-access-control [Operation not >permitted] >>> [2020-04-30 21:49:10.439701] I [MSGID: 115072] >>> [server-rpc-fops_v2.c:1680:server4_setattr_cbk] >>> 0-androidpolice_data3-server: 201833: SETATTR / >>> androidpolice.com/public/wp-content/uploads >>> (2692eeba-1ebe-49b6-927f-1dfbcd227591), client: >>> >CTX_ID:af341e80-70ff-4d23-99ef-3d846a546fc9-GRAPH_ID:0-PID:2358-HOST:forge-PC_NAME:androidpolice_data3-client-3-RECON_NO:-2, >>> error-xlator: androidpolice_data3-access-control [Operation not >permitted] >>> [2020-04-30 21:49:10.453724] I [MSGID: 115072] >>> [server-rpc-fops_v2.c:1680:server4_setattr_cbk] >>> 0-androidpolice_data3-server: 201842: SETATTR / >>> androidpolice.com/public/wp-content/uploads >>> (2692eeba-1ebe-49b6-927f-1dfbcd227591), client: >>> >CTX_ID:af341e80-70ff-4d23-99ef-3d846a546fc9-GRAPH_ID:0-PID:2358-HOST:forge-PC_NAME:androidpolice_data3-client-3-RECON_NO:-2, >>> error-xlator: androidpolice_data3-access-control [Operation not >permitted] >>> [2020-04-30 21:49:16.224662] I [MSGID: 115072] >>> [server-rpc-fops_v2.c:1680:server4_setattr_cbk] >>> 0-androidpolice_data3-server: 202865: SETATTR / >>> androidpolice.com/public/wp-content/uploads >>> (2692eeba-1ebe-49b6-927f-1dfbcd227591), client: >>> >CTX_ID:32e2d636-038a-472d-8199-007555d1805f-GRAPH_ID:0-PID:14265-HOST:nexus2-PC_NAME:androidpolice_data3-client-3-RECON_NO:-2, >>> error-xlator: androidpolice_data3-access-control [Operation not >permitted] >>> >>> There's also lots of self-healing happening that I didn't expect at >all, >>> since the upgrade only took ~10-15s. >>> [2020-04-30 21:47:38.714448] I [MSGID: 108026] >>> [afr-self-heal-metadata.c:52:__afr_selfheal_metadata_do] >>> 0-apkmirror_data1-replicate-0: performing metadata selfheal on >>> 4a6ba2d7-7ad8-4113-862b-02e4934a3461 >>> [2020-04-30 21:47:38.765033] I [MSGID: 108026] >>> [afr-self-heal-common.c:1723:afr_log_selfheal] >>> 0-apkmirror_data1-replicate-0: Completed metadata selfheal on >>> 4a6ba2d7-7ad8-4113-862b-02e4934a3461. sources=[3] sinks=0 1 2 >>> [2020-04-30 21:47:38.765289] I [MSGID: 108026] >>> [afr-self-heal-metadata.c:52:__afr_selfheal_metadata_do] >>> 0-apkmirror_data1-replicate-0: performing metadata selfheal on >>> f3c62a41-1864-4e75-9883-4357a7091296 >>> [2020-04-30 21:47:38.800987] I [MSGID: 108026] >>> [afr-self-heal-common.c:1723:afr_log_selfheal] >>> 0-apkmirror_data1-replicate-0: Completed metadata selfheal on >>> f3c62a41-1864-4e75-9883-4357a7091296. sources=[3] sinks=0 1 2 >>> >>> >>> I'm also seeing "remote operation failed" and "writing to fuse >device >>> failed: No such file or directory" messages >>> [2020-04-30 21:46:34.891957] I [MSGID: 108026] >>> [afr-self-heal-common.c:1723:afr_log_selfheal] >>> 0-androidpolice_data3-replicate-0: Completed metadata selfheal on >>> 2692eeba-1ebe-49b6-927f-1dfbcd227591. sources=0 1 [2] sinks=3 >>> [2020-04-30 21:45:36.127412] W [MSGID: 114031] >>> [client-rpc-fops_v2.c:1985:client4_0_setattr_cbk] >>> 0-androidpolice_data3-client-0: remote operation failed [Operation >not >>> permitted] >>> [2020-04-30 21:45:36.345924] W [MSGID: 114031] >>> [client-rpc-fops_v2.c:1985:client4_0_setattr_cbk] >>> 0-androidpolice_data3-client-1: remote operation failed [Operation >not >>> permitted] >>> [2020-04-30 21:46:35.291853] I [MSGID: 108031] >>> [afr-common.c:2543:afr_local_discovery_cbk] >>> 0-androidpolice_data3-replicate-0: selecting local read_child >>> androidpolice_data3-client-2 >>> [2020-04-30 21:46:35.977342] I [MSGID: 108026] >>> [afr-self-heal-metadata.c:52:__afr_selfheal_metadata_do] >>> 0-androidpolice_data3-replicate-0: performing metadata selfheal on >>> 2692eeba-1ebe-49b6-927f-1dfbcd227591 >>> [2020-04-30 21:46:36.006607] I [MSGID: 108026] >>> [afr-self-heal-common.c:1723:afr_log_selfheal] >>> 0-androidpolice_data3-replicate-0: Completed metadata selfheal on >>> 2692eeba-1ebe-49b6-927f-1dfbcd227591. sources=0 1 [2] sinks=3 >>> [2020-04-30 21:46:37.245599] E >[fuse-bridge.c:219:check_and_dump_fuse_W] >>> (--> >/usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x17d)[0x7fd13d50624d] >>> (--> >>> >/usr/lib64/glusterfs/5.13/xlator/mount/fuse.so(+0x849a)[0x7fd1398e949a] >>> (--> >>> >/usr/lib64/glusterfs/5.13/xlator/mount/fuse.so(+0x87bb)[0x7fd1398e97bb] >>> (--> /lib64/libpthread.so.0(+0x84f9)[0x7fd13ca564f9] (--> >>> /lib64/libc.so.6(clone+0x3f)[0x7fd13c78ef2f] ))))) 0-glusterfs-fuse: >>> writing to fuse device failed: No such file or directory >>> [2020-04-30 21:46:50.864797] E >[fuse-bridge.c:219:check_and_dump_fuse_W] >>> (--> >/usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x17d)[0x7fd13d50624d] >>> (--> >>> >/usr/lib64/glusterfs/5.13/xlator/mount/fuse.so(+0x849a)[0x7fd1398e949a] >>> (--> >>> >/usr/lib64/glusterfs/5.13/xlator/mount/fuse.so(+0x87bb)[0x7fd1398e97bb] >>> (--> /lib64/libpthread.so.0(+0x84f9)[0x7fd13ca564f9] (--> >>> /lib64/libc.so.6(clone+0x3f)[0x7fd13c78ef2f] ))))) 0-glusterfs-fuse: >>> writing to fuse device failed: No such file or directory >>> >>> The number of items being healed is going up and down wildly, from 0 >to >>> 8000+ and sometimes taking a really long time to return a value. I'm >really >>> worried as this is a production system, and I didn't observe this in >our >>> test system. >>> >>> >>> >>> gluster v heal apkmirror_data1 info summary >>> Brick nexus2:/mnt/nexus2_block1/apkmirror_data1 >>> Status: Connected >>> Total Number of entries: 27 >>> Number of entries in heal pending: 27 >>> Number of entries in split-brain: 0 >>> Number of entries possibly healing: 0 >>> >>> Brick forge:/mnt/forge_block1/apkmirror_data1 >>> Status: Connected >>> Total Number of entries: 27 >>> Number of entries in heal pending: 27 >>> Number of entries in split-brain: 0 >>> Number of entries possibly healing: 0 >>> >>> Brick hive:/mnt/hive_block1/apkmirror_data1 >>> Status: Connected >>> Total Number of entries: 27 >>> Number of entries in heal pending: 27 >>> Number of entries in split-brain: 0 >>> Number of entries possibly healing: 0 >>> >>> Brick citadel:/mnt/citadel_block1/apkmirror_data1 >>> Status: Connected >>> Total Number of entries: 8540 >>> Number of entries in heal pending: 8540 >>> Number of entries in split-brain: 0 >>> Number of entries possibly healing: 0 >>> >>> >>> >>> gluster v heal androidpolice_data3 info summary >>> Brick nexus2:/mnt/nexus2_block4/androidpolice_data3 >>> Status: Connected >>> Total Number of entries: 1 >>> Number of entries in heal pending: 1 >>> Number of entries in split-brain: 0 >>> Number of entries possibly healing: 0 >>> >>> Brick forge:/mnt/forge_block4/androidpolice_data3 >>> Status: Connected >>> Total Number of entries: 1 >>> Number of entries in heal pending: 1 >>> Number of entries in split-brain: 0 >>> Number of entries possibly healing: 0 >>> >>> Brick hive:/mnt/hive_block4/androidpolice_data3 >>> Status: Connected >>> Total Number of entries: 1 >>> Number of entries in heal pending: 1 >>> Number of entries in split-brain: 0 >>> Number of entries possibly healing: 0 >>> >>> Brick citadel:/mnt/citadel_block4/androidpolice_data3 >>> Status: Connected >>> Total Number of entries: 1149 >>> Number of entries in heal pending: 1149 >>> Number of entries in split-brain: 0 >>> Number of entries possibly healing: 0 >>> >>> >>> What should I do at this point? The files I tested seem to be >replicating >>> correctly, but I don't know if it's the case for all of them, and >the heals >>> going up and down, and all these log messages are making me very >nervous. >>> >>> Thank you. >>> >>> Sincerely, >>> Artem >>> >>> -- >>> Founder, Android Police <http://www.androidpolice.com>, APK Mirror >>> <http://www.apkmirror.com/>, Illogical Robot LLC >>> beerpla.net | @ArtemR <http://twitter.com/ArtemR> >>> >>I's not supported , but usually it works. In worst case scenario, you can remove the node, wipe gluster on the node, reinstall the packages and add it - it will require full heal of the brick and as you have previously reported could lead to performance degradation. I think you are on SLES, but I could be wrong . Do you have btrfs or LVM snapshots to revert from ? Best Regards, Strahil Nikolov
Artem Russakovskii
2020-May-01 14:25 UTC
[Gluster-users] Upgrade from 5.13 to 7.5 full of weird messages
I do not have snapshots, no. I have a general file based backup, but also the other 3 nodes are up. OpenSUSE 15.1. If I try to downgrade and it doesn't work, what's the brick replacement scenario - is this still accurate? https://docs.gluster.org/en/latest/Administrator%20Guide/Managing%20Volumes/#replace-brick Any feedback about the issues themselves yet please? Specifically, is there a chance this is happening because of the mismatched gluster versions? Though, what's the solution then? On Fri, May 1, 2020, 1:07 AM Strahil Nikolov <hunter86_bg at yahoo.com> wrote:> On May 1, 2020 1:25:17 AM GMT+03:00, Artem Russakovskii < > archon810 at gmail.com> wrote: > >If more time is needed to analyze this, is this an option? Shut down > >7.5, > >downgrade it back to 5.13 and restart, or would this screw something up > >badly? I didn't up the op-version yet. > > > >Thanks. > > > >Sincerely, > >Artem > > > >-- > >Founder, Android Police <http://www.androidpolice.com>, APK Mirror > ><http://www.apkmirror.com/>, Illogical Robot LLC > >beerpla.net | @ArtemR <http://twitter.com/ArtemR> > > > > > >On Thu, Apr 30, 2020 at 3:13 PM Artem Russakovskii > ><archon810 at gmail.com> > >wrote: > > > >> The number of heal pending on citadel, the one that was upgraded to > >7.5, > >> has now gone to 10s of thousands and continues to go up. > >> > >> Sincerely, > >> Artem > >> > >> -- > >> Founder, Android Police <http://www.androidpolice.com>, APK Mirror > >> <http://www.apkmirror.com/>, Illogical Robot LLC > >> beerpla.net | @ArtemR <http://twitter.com/ArtemR> > >> > >> > >> On Thu, Apr 30, 2020 at 2:57 PM Artem Russakovskii > ><archon810 at gmail.com> > >> wrote: > >> > >>> Hi all, > >>> > >>> Today, I decided to upgrade one of the four servers (citadel) we > >have to > >>> 7.5 from 5.13. There are 2 volumes, 1x4 replicate, and fuse mounts > >(I sent > >>> the full details earlier in another message). If everything looked > >OK, I > >>> would have proceeded the rolling upgrade for all of them, following > >the > >>> full heal. > >>> > >>> However, as soon as I upgraded and restarted, the logs filled with > >>> messages like these: > >>> > >>> [2020-04-30 21:39:21.316149] E > >>> [rpcsvc.c:567:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor > >>> (1298437:400:17) failed to complete successfully > >>> [2020-04-30 21:39:21.382891] E > >>> [rpcsvc.c:567:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor > >>> (1298437:400:17) failed to complete successfully > >>> [2020-04-30 21:39:21.442440] E > >>> [rpcsvc.c:567:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor > >>> (1298437:400:17) failed to complete successfully > >>> [2020-04-30 21:39:21.445587] E > >>> [rpcsvc.c:567:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor > >>> (1298437:400:17) failed to complete successfully > >>> [2020-04-30 21:39:21.571398] E > >>> [rpcsvc.c:567:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor > >>> (1298437:400:17) failed to complete successfully > >>> [2020-04-30 21:39:21.668192] E > >>> [rpcsvc.c:567:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor > >>> (1298437:400:17) failed to complete successfully > >>> > >>> > >>> The message "I [MSGID: 108031] > >>> [afr-common.c:2581:afr_local_discovery_cbk] > >>> 0-androidpolice_data3-replicate-0: selecting local read_child > >>> androidpolice_data3-client-3" repeated 10 times between [2020-04-30 > >>> 21:46:41.854675] and [2020-04-30 21:48:20.206323] > >>> The message "W [MSGID: 114031] > >>> [client-rpc-fops_v2.c:850:client4_0_setxattr_cbk] > >>> 0-androidpolice_data3-client-1: remote operation failed [Transport > >endpoint > >>> is not connected]" repeated 264 times between [2020-04-30 > >21:46:32.129567] > >>> and [2020-04-30 21:48:29.905008] > >>> The message "W [MSGID: 114031] > >>> [client-rpc-fops_v2.c:850:client4_0_setxattr_cbk] > >>> 0-androidpolice_data3-client-0: remote operation failed [Transport > >endpoint > >>> is not connected]" repeated 264 times between [2020-04-30 > >21:46:32.129602] > >>> and [2020-04-30 21:48:29.905040] > >>> The message "W [MSGID: 114031] > >>> [client-rpc-fops_v2.c:850:client4_0_setxattr_cbk] > >>> 0-androidpolice_data3-client-2: remote operation failed [Transport > >endpoint > >>> is not connected]" repeated 264 times between [2020-04-30 > >21:46:32.129512] > >>> and [2020-04-30 21:48:29.905047] > >>> > >>> > >>> > >>> Once in a while, I'm seeing this: > >>> ==> bricks/mnt-hive_block4-androidpolice_data3.log <=> >>> [2020-04-30 21:45:54.251637] I [MSGID: 115072] > >>> [server-rpc-fops_v2.c:1681:server4_setattr_cbk] > >>> 0-androidpolice_data3-server: 5725811: SETATTR / > >>> > > > androidpolice.com/public/wp-content/uploads/2019/03/cielo-breez-plus-hero.png > >>> (d4556eb4-f15b-412c-a42a-32b4438af557), client: > >>> > > >CTX_ID:32e2d636-038a-472d-8199-007555d1805f-GRAPH_ID:0-PID:14265-HOST:nexus2-PC_NAME:androidpolice_data3-client-2-RECON_NO:-1, > >>> error-xlator: androidpolice_data3-access-control [Operation not > >permitted] > >>> [2020-04-30 21:49:10.439701] I [MSGID: 115072] > >>> [server-rpc-fops_v2.c:1680:server4_setattr_cbk] > >>> 0-androidpolice_data3-server: 201833: SETATTR / > >>> androidpolice.com/public/wp-content/uploads > >>> (2692eeba-1ebe-49b6-927f-1dfbcd227591), client: > >>> > > >CTX_ID:af341e80-70ff-4d23-99ef-3d846a546fc9-GRAPH_ID:0-PID:2358-HOST:forge-PC_NAME:androidpolice_data3-client-3-RECON_NO:-2, > >>> error-xlator: androidpolice_data3-access-control [Operation not > >permitted] > >>> [2020-04-30 21:49:10.453724] I [MSGID: 115072] > >>> [server-rpc-fops_v2.c:1680:server4_setattr_cbk] > >>> 0-androidpolice_data3-server: 201842: SETATTR / > >>> androidpolice.com/public/wp-content/uploads > >>> (2692eeba-1ebe-49b6-927f-1dfbcd227591), client: > >>> > > >CTX_ID:af341e80-70ff-4d23-99ef-3d846a546fc9-GRAPH_ID:0-PID:2358-HOST:forge-PC_NAME:androidpolice_data3-client-3-RECON_NO:-2, > >>> error-xlator: androidpolice_data3-access-control [Operation not > >permitted] > >>> [2020-04-30 21:49:16.224662] I [MSGID: 115072] > >>> [server-rpc-fops_v2.c:1680:server4_setattr_cbk] > >>> 0-androidpolice_data3-server: 202865: SETATTR / > >>> androidpolice.com/public/wp-content/uploads > >>> (2692eeba-1ebe-49b6-927f-1dfbcd227591), client: > >>> > > >CTX_ID:32e2d636-038a-472d-8199-007555d1805f-GRAPH_ID:0-PID:14265-HOST:nexus2-PC_NAME:androidpolice_data3-client-3-RECON_NO:-2, > >>> error-xlator: androidpolice_data3-access-control [Operation not > >permitted] > >>> > >>> There's also lots of self-healing happening that I didn't expect at > >all, > >>> since the upgrade only took ~10-15s. > >>> [2020-04-30 21:47:38.714448] I [MSGID: 108026] > >>> [afr-self-heal-metadata.c:52:__afr_selfheal_metadata_do] > >>> 0-apkmirror_data1-replicate-0: performing metadata selfheal on > >>> 4a6ba2d7-7ad8-4113-862b-02e4934a3461 > >>> [2020-04-30 21:47:38.765033] I [MSGID: 108026] > >>> [afr-self-heal-common.c:1723:afr_log_selfheal] > >>> 0-apkmirror_data1-replicate-0: Completed metadata selfheal on > >>> 4a6ba2d7-7ad8-4113-862b-02e4934a3461. sources=[3] sinks=0 1 2 > >>> [2020-04-30 21:47:38.765289] I [MSGID: 108026] > >>> [afr-self-heal-metadata.c:52:__afr_selfheal_metadata_do] > >>> 0-apkmirror_data1-replicate-0: performing metadata selfheal on > >>> f3c62a41-1864-4e75-9883-4357a7091296 > >>> [2020-04-30 21:47:38.800987] I [MSGID: 108026] > >>> [afr-self-heal-common.c:1723:afr_log_selfheal] > >>> 0-apkmirror_data1-replicate-0: Completed metadata selfheal on > >>> f3c62a41-1864-4e75-9883-4357a7091296. sources=[3] sinks=0 1 2 > >>> > >>> > >>> I'm also seeing "remote operation failed" and "writing to fuse > >device > >>> failed: No such file or directory" messages > >>> [2020-04-30 21:46:34.891957] I [MSGID: 108026] > >>> [afr-self-heal-common.c:1723:afr_log_selfheal] > >>> 0-androidpolice_data3-replicate-0: Completed metadata selfheal on > >>> 2692eeba-1ebe-49b6-927f-1dfbcd227591. sources=0 1 [2] sinks=3 > >>> [2020-04-30 21:45:36.127412] W [MSGID: 114031] > >>> [client-rpc-fops_v2.c:1985:client4_0_setattr_cbk] > >>> 0-androidpolice_data3-client-0: remote operation failed [Operation > >not > >>> permitted] > >>> [2020-04-30 21:45:36.345924] W [MSGID: 114031] > >>> [client-rpc-fops_v2.c:1985:client4_0_setattr_cbk] > >>> 0-androidpolice_data3-client-1: remote operation failed [Operation > >not > >>> permitted] > >>> [2020-04-30 21:46:35.291853] I [MSGID: 108031] > >>> [afr-common.c:2543:afr_local_discovery_cbk] > >>> 0-androidpolice_data3-replicate-0: selecting local read_child > >>> androidpolice_data3-client-2 > >>> [2020-04-30 21:46:35.977342] I [MSGID: 108026] > >>> [afr-self-heal-metadata.c:52:__afr_selfheal_metadata_do] > >>> 0-androidpolice_data3-replicate-0: performing metadata selfheal on > >>> 2692eeba-1ebe-49b6-927f-1dfbcd227591 > >>> [2020-04-30 21:46:36.006607] I [MSGID: 108026] > >>> [afr-self-heal-common.c:1723:afr_log_selfheal] > >>> 0-androidpolice_data3-replicate-0: Completed metadata selfheal on > >>> 2692eeba-1ebe-49b6-927f-1dfbcd227591. sources=0 1 [2] sinks=3 > >>> [2020-04-30 21:46:37.245599] E > >[fuse-bridge.c:219:check_and_dump_fuse_W] > >>> (--> > >/usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x17d)[0x7fd13d50624d] > >>> (--> > >>> > >/usr/lib64/glusterfs/5.13/xlator/mount/fuse.so(+0x849a)[0x7fd1398e949a] > >>> (--> > >>> > >/usr/lib64/glusterfs/5.13/xlator/mount/fuse.so(+0x87bb)[0x7fd1398e97bb] > >>> (--> /lib64/libpthread.so.0(+0x84f9)[0x7fd13ca564f9] (--> > >>> /lib64/libc.so.6(clone+0x3f)[0x7fd13c78ef2f] ))))) 0-glusterfs-fuse: > >>> writing to fuse device failed: No such file or directory > >>> [2020-04-30 21:46:50.864797] E > >[fuse-bridge.c:219:check_and_dump_fuse_W] > >>> (--> > >/usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x17d)[0x7fd13d50624d] > >>> (--> > >>> > >/usr/lib64/glusterfs/5.13/xlator/mount/fuse.so(+0x849a)[0x7fd1398e949a] > >>> (--> > >>> > >/usr/lib64/glusterfs/5.13/xlator/mount/fuse.so(+0x87bb)[0x7fd1398e97bb] > >>> (--> /lib64/libpthread.so.0(+0x84f9)[0x7fd13ca564f9] (--> > >>> /lib64/libc.so.6(clone+0x3f)[0x7fd13c78ef2f] ))))) 0-glusterfs-fuse: > >>> writing to fuse device failed: No such file or directory > >>> > >>> The number of items being healed is going up and down wildly, from 0 > >to > >>> 8000+ and sometimes taking a really long time to return a value. I'm > >really > >>> worried as this is a production system, and I didn't observe this in > >our > >>> test system. > >>> > >>> > >>> > >>> gluster v heal apkmirror_data1 info summary > >>> Brick nexus2:/mnt/nexus2_block1/apkmirror_data1 > >>> Status: Connected > >>> Total Number of entries: 27 > >>> Number of entries in heal pending: 27 > >>> Number of entries in split-brain: 0 > >>> Number of entries possibly healing: 0 > >>> > >>> Brick forge:/mnt/forge_block1/apkmirror_data1 > >>> Status: Connected > >>> Total Number of entries: 27 > >>> Number of entries in heal pending: 27 > >>> Number of entries in split-brain: 0 > >>> Number of entries possibly healing: 0 > >>> > >>> Brick hive:/mnt/hive_block1/apkmirror_data1 > >>> Status: Connected > >>> Total Number of entries: 27 > >>> Number of entries in heal pending: 27 > >>> Number of entries in split-brain: 0 > >>> Number of entries possibly healing: 0 > >>> > >>> Brick citadel:/mnt/citadel_block1/apkmirror_data1 > >>> Status: Connected > >>> Total Number of entries: 8540 > >>> Number of entries in heal pending: 8540 > >>> Number of entries in split-brain: 0 > >>> Number of entries possibly healing: 0 > >>> > >>> > >>> > >>> gluster v heal androidpolice_data3 info summary > >>> Brick nexus2:/mnt/nexus2_block4/androidpolice_data3 > >>> Status: Connected > >>> Total Number of entries: 1 > >>> Number of entries in heal pending: 1 > >>> Number of entries in split-brain: 0 > >>> Number of entries possibly healing: 0 > >>> > >>> Brick forge:/mnt/forge_block4/androidpolice_data3 > >>> Status: Connected > >>> Total Number of entries: 1 > >>> Number of entries in heal pending: 1 > >>> Number of entries in split-brain: 0 > >>> Number of entries possibly healing: 0 > >>> > >>> Brick hive:/mnt/hive_block4/androidpolice_data3 > >>> Status: Connected > >>> Total Number of entries: 1 > >>> Number of entries in heal pending: 1 > >>> Number of entries in split-brain: 0 > >>> Number of entries possibly healing: 0 > >>> > >>> Brick citadel:/mnt/citadel_block4/androidpolice_data3 > >>> Status: Connected > >>> Total Number of entries: 1149 > >>> Number of entries in heal pending: 1149 > >>> Number of entries in split-brain: 0 > >>> Number of entries possibly healing: 0 > >>> > >>> > >>> What should I do at this point? The files I tested seem to be > >replicating > >>> correctly, but I don't know if it's the case for all of them, and > >the heals > >>> going up and down, and all these log messages are making me very > >nervous. > >>> > >>> Thank you. > >>> > >>> Sincerely, > >>> Artem > >>> > >>> -- > >>> Founder, Android Police <http://www.androidpolice.com>, APK Mirror > >>> <http://www.apkmirror.com/>, Illogical Robot LLC > >>> beerpla.net | @ArtemR <http://twitter.com/ArtemR> > >>> > >> > > I's not supported , but usually it works. > > In worst case scenario, you can remove the node, wipe gluster on the > node, reinstall the packages and add it - it will require full heal of the > brick and as you have previously reported could lead to performance > degradation. > > I think you are on SLES, but I could be wrong . Do you have btrfs or LVM > snapshots to revert from ? > > Best Regards, > Strahil Nikolov >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20200501/6373add4/attachment.html>