Artem Russakovskii
2020-Apr-30 22:13 UTC
[Gluster-users] Upgrade from 5.13 to 7.5 full of weird messages
The number of heal pending on citadel, the one that was upgraded to 7.5, has now gone to 10s of thousands and continues to go up. Sincerely, Artem -- Founder, Android Police <http://www.androidpolice.com>, APK Mirror <http://www.apkmirror.com/>, Illogical Robot LLC beerpla.net | @ArtemR <http://twitter.com/ArtemR> On Thu, Apr 30, 2020 at 2:57 PM Artem Russakovskii <archon810 at gmail.com> wrote:> Hi all, > > Today, I decided to upgrade one of the four servers (citadel) we have to > 7.5 from 5.13. There are 2 volumes, 1x4 replicate, and fuse mounts (I sent > the full details earlier in another message). If everything looked OK, I > would have proceeded the rolling upgrade for all of them, following the > full heal. > > However, as soon as I upgraded and restarted, the logs filled with > messages like these: > > [2020-04-30 21:39:21.316149] E [rpcsvc.c:567:rpcsvc_check_and_reply_error] > 0-rpcsvc: rpc actor (1298437:400:17) failed to complete successfully > [2020-04-30 21:39:21.382891] E [rpcsvc.c:567:rpcsvc_check_and_reply_error] > 0-rpcsvc: rpc actor (1298437:400:17) failed to complete successfully > [2020-04-30 21:39:21.442440] E [rpcsvc.c:567:rpcsvc_check_and_reply_error] > 0-rpcsvc: rpc actor (1298437:400:17) failed to complete successfully > [2020-04-30 21:39:21.445587] E [rpcsvc.c:567:rpcsvc_check_and_reply_error] > 0-rpcsvc: rpc actor (1298437:400:17) failed to complete successfully > [2020-04-30 21:39:21.571398] E [rpcsvc.c:567:rpcsvc_check_and_reply_error] > 0-rpcsvc: rpc actor (1298437:400:17) failed to complete successfully > [2020-04-30 21:39:21.668192] E [rpcsvc.c:567:rpcsvc_check_and_reply_error] > 0-rpcsvc: rpc actor (1298437:400:17) failed to complete successfully > > > The message "I [MSGID: 108031] [afr-common.c:2581:afr_local_discovery_cbk] > 0-androidpolice_data3-replicate-0: selecting local read_child > androidpolice_data3-client-3" repeated 10 times between [2020-04-30 > 21:46:41.854675] and [2020-04-30 21:48:20.206323] > The message "W [MSGID: 114031] > [client-rpc-fops_v2.c:850:client4_0_setxattr_cbk] > 0-androidpolice_data3-client-1: remote operation failed [Transport endpoint > is not connected]" repeated 264 times between [2020-04-30 21:46:32.129567] > and [2020-04-30 21:48:29.905008] > The message "W [MSGID: 114031] > [client-rpc-fops_v2.c:850:client4_0_setxattr_cbk] > 0-androidpolice_data3-client-0: remote operation failed [Transport endpoint > is not connected]" repeated 264 times between [2020-04-30 21:46:32.129602] > and [2020-04-30 21:48:29.905040] > The message "W [MSGID: 114031] > [client-rpc-fops_v2.c:850:client4_0_setxattr_cbk] > 0-androidpolice_data3-client-2: remote operation failed [Transport endpoint > is not connected]" repeated 264 times between [2020-04-30 21:46:32.129512] > and [2020-04-30 21:48:29.905047] > > > > Once in a while, I'm seeing this: > ==> bricks/mnt-hive_block4-androidpolice_data3.log <=> [2020-04-30 21:45:54.251637] I [MSGID: 115072] > [server-rpc-fops_v2.c:1681:server4_setattr_cbk] > 0-androidpolice_data3-server: 5725811: SETATTR / > androidpolice.com/public/wp-content/uploads/2019/03/cielo-breez-plus-hero.png > (d4556eb4-f15b-412c-a42a-32b4438af557), client: > CTX_ID:32e2d636-038a-472d-8199-007555d1805f-GRAPH_ID:0-PID:14265-HOST:nexus2-PC_NAME:androidpolice_data3-client-2-RECON_NO:-1, > error-xlator: androidpolice_data3-access-control [Operation not permitted] > [2020-04-30 21:49:10.439701] I [MSGID: 115072] > [server-rpc-fops_v2.c:1680:server4_setattr_cbk] > 0-androidpolice_data3-server: 201833: SETATTR / > androidpolice.com/public/wp-content/uploads > (2692eeba-1ebe-49b6-927f-1dfbcd227591), client: > CTX_ID:af341e80-70ff-4d23-99ef-3d846a546fc9-GRAPH_ID:0-PID:2358-HOST:forge-PC_NAME:androidpolice_data3-client-3-RECON_NO:-2, > error-xlator: androidpolice_data3-access-control [Operation not permitted] > [2020-04-30 21:49:10.453724] I [MSGID: 115072] > [server-rpc-fops_v2.c:1680:server4_setattr_cbk] > 0-androidpolice_data3-server: 201842: SETATTR / > androidpolice.com/public/wp-content/uploads > (2692eeba-1ebe-49b6-927f-1dfbcd227591), client: > CTX_ID:af341e80-70ff-4d23-99ef-3d846a546fc9-GRAPH_ID:0-PID:2358-HOST:forge-PC_NAME:androidpolice_data3-client-3-RECON_NO:-2, > error-xlator: androidpolice_data3-access-control [Operation not permitted] > [2020-04-30 21:49:16.224662] I [MSGID: 115072] > [server-rpc-fops_v2.c:1680:server4_setattr_cbk] > 0-androidpolice_data3-server: 202865: SETATTR / > androidpolice.com/public/wp-content/uploads > (2692eeba-1ebe-49b6-927f-1dfbcd227591), client: > CTX_ID:32e2d636-038a-472d-8199-007555d1805f-GRAPH_ID:0-PID:14265-HOST:nexus2-PC_NAME:androidpolice_data3-client-3-RECON_NO:-2, > error-xlator: androidpolice_data3-access-control [Operation not permitted] > > There's also lots of self-healing happening that I didn't expect at all, > since the upgrade only took ~10-15s. > [2020-04-30 21:47:38.714448] I [MSGID: 108026] > [afr-self-heal-metadata.c:52:__afr_selfheal_metadata_do] > 0-apkmirror_data1-replicate-0: performing metadata selfheal on > 4a6ba2d7-7ad8-4113-862b-02e4934a3461 > [2020-04-30 21:47:38.765033] I [MSGID: 108026] > [afr-self-heal-common.c:1723:afr_log_selfheal] > 0-apkmirror_data1-replicate-0: Completed metadata selfheal on > 4a6ba2d7-7ad8-4113-862b-02e4934a3461. sources=[3] sinks=0 1 2 > [2020-04-30 21:47:38.765289] I [MSGID: 108026] > [afr-self-heal-metadata.c:52:__afr_selfheal_metadata_do] > 0-apkmirror_data1-replicate-0: performing metadata selfheal on > f3c62a41-1864-4e75-9883-4357a7091296 > [2020-04-30 21:47:38.800987] I [MSGID: 108026] > [afr-self-heal-common.c:1723:afr_log_selfheal] > 0-apkmirror_data1-replicate-0: Completed metadata selfheal on > f3c62a41-1864-4e75-9883-4357a7091296. sources=[3] sinks=0 1 2 > > > I'm also seeing "remote operation failed" and "writing to fuse device > failed: No such file or directory" messages > [2020-04-30 21:46:34.891957] I [MSGID: 108026] > [afr-self-heal-common.c:1723:afr_log_selfheal] > 0-androidpolice_data3-replicate-0: Completed metadata selfheal on > 2692eeba-1ebe-49b6-927f-1dfbcd227591. sources=0 1 [2] sinks=3 > [2020-04-30 21:45:36.127412] W [MSGID: 114031] > [client-rpc-fops_v2.c:1985:client4_0_setattr_cbk] > 0-androidpolice_data3-client-0: remote operation failed [Operation not > permitted] > [2020-04-30 21:45:36.345924] W [MSGID: 114031] > [client-rpc-fops_v2.c:1985:client4_0_setattr_cbk] > 0-androidpolice_data3-client-1: remote operation failed [Operation not > permitted] > [2020-04-30 21:46:35.291853] I [MSGID: 108031] > [afr-common.c:2543:afr_local_discovery_cbk] > 0-androidpolice_data3-replicate-0: selecting local read_child > androidpolice_data3-client-2 > [2020-04-30 21:46:35.977342] I [MSGID: 108026] > [afr-self-heal-metadata.c:52:__afr_selfheal_metadata_do] > 0-androidpolice_data3-replicate-0: performing metadata selfheal on > 2692eeba-1ebe-49b6-927f-1dfbcd227591 > [2020-04-30 21:46:36.006607] I [MSGID: 108026] > [afr-self-heal-common.c:1723:afr_log_selfheal] > 0-androidpolice_data3-replicate-0: Completed metadata selfheal on > 2692eeba-1ebe-49b6-927f-1dfbcd227591. sources=0 1 [2] sinks=3 > [2020-04-30 21:46:37.245599] E [fuse-bridge.c:219:check_and_dump_fuse_W] > (--> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x17d)[0x7fd13d50624d] > (--> > /usr/lib64/glusterfs/5.13/xlator/mount/fuse.so(+0x849a)[0x7fd1398e949a] > (--> > /usr/lib64/glusterfs/5.13/xlator/mount/fuse.so(+0x87bb)[0x7fd1398e97bb] > (--> /lib64/libpthread.so.0(+0x84f9)[0x7fd13ca564f9] (--> > /lib64/libc.so.6(clone+0x3f)[0x7fd13c78ef2f] ))))) 0-glusterfs-fuse: > writing to fuse device failed: No such file or directory > [2020-04-30 21:46:50.864797] E [fuse-bridge.c:219:check_and_dump_fuse_W] > (--> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x17d)[0x7fd13d50624d] > (--> > /usr/lib64/glusterfs/5.13/xlator/mount/fuse.so(+0x849a)[0x7fd1398e949a] > (--> > /usr/lib64/glusterfs/5.13/xlator/mount/fuse.so(+0x87bb)[0x7fd1398e97bb] > (--> /lib64/libpthread.so.0(+0x84f9)[0x7fd13ca564f9] (--> > /lib64/libc.so.6(clone+0x3f)[0x7fd13c78ef2f] ))))) 0-glusterfs-fuse: > writing to fuse device failed: No such file or directory > > The number of items being healed is going up and down wildly, from 0 to > 8000+ and sometimes taking a really long time to return a value. I'm really > worried as this is a production system, and I didn't observe this in our > test system. > > > > gluster v heal apkmirror_data1 info summary > Brick nexus2:/mnt/nexus2_block1/apkmirror_data1 > Status: Connected > Total Number of entries: 27 > Number of entries in heal pending: 27 > Number of entries in split-brain: 0 > Number of entries possibly healing: 0 > > Brick forge:/mnt/forge_block1/apkmirror_data1 > Status: Connected > Total Number of entries: 27 > Number of entries in heal pending: 27 > Number of entries in split-brain: 0 > Number of entries possibly healing: 0 > > Brick hive:/mnt/hive_block1/apkmirror_data1 > Status: Connected > Total Number of entries: 27 > Number of entries in heal pending: 27 > Number of entries in split-brain: 0 > Number of entries possibly healing: 0 > > Brick citadel:/mnt/citadel_block1/apkmirror_data1 > Status: Connected > Total Number of entries: 8540 > Number of entries in heal pending: 8540 > Number of entries in split-brain: 0 > Number of entries possibly healing: 0 > > > > gluster v heal androidpolice_data3 info summary > Brick nexus2:/mnt/nexus2_block4/androidpolice_data3 > Status: Connected > Total Number of entries: 1 > Number of entries in heal pending: 1 > Number of entries in split-brain: 0 > Number of entries possibly healing: 0 > > Brick forge:/mnt/forge_block4/androidpolice_data3 > Status: Connected > Total Number of entries: 1 > Number of entries in heal pending: 1 > Number of entries in split-brain: 0 > Number of entries possibly healing: 0 > > Brick hive:/mnt/hive_block4/androidpolice_data3 > Status: Connected > Total Number of entries: 1 > Number of entries in heal pending: 1 > Number of entries in split-brain: 0 > Number of entries possibly healing: 0 > > Brick citadel:/mnt/citadel_block4/androidpolice_data3 > Status: Connected > Total Number of entries: 1149 > Number of entries in heal pending: 1149 > Number of entries in split-brain: 0 > Number of entries possibly healing: 0 > > > What should I do at this point? The files I tested seem to be replicating > correctly, but I don't know if it's the case for all of them, and the heals > going up and down, and all these log messages are making me very nervous. > > Thank you. > > Sincerely, > Artem > > -- > Founder, Android Police <http://www.androidpolice.com>, APK Mirror > <http://www.apkmirror.com/>, Illogical Robot LLC > beerpla.net | @ArtemR <http://twitter.com/ArtemR> >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20200430/fe93ed00/attachment.html>
Artem Russakovskii
2020-Apr-30 22:25 UTC
[Gluster-users] Upgrade from 5.13 to 7.5 full of weird messages
If more time is needed to analyze this, is this an option? Shut down 7.5, downgrade it back to 5.13 and restart, or would this screw something up badly? I didn't up the op-version yet. Thanks. Sincerely, Artem -- Founder, Android Police <http://www.androidpolice.com>, APK Mirror <http://www.apkmirror.com/>, Illogical Robot LLC beerpla.net | @ArtemR <http://twitter.com/ArtemR> On Thu, Apr 30, 2020 at 3:13 PM Artem Russakovskii <archon810 at gmail.com> wrote:> The number of heal pending on citadel, the one that was upgraded to 7.5, > has now gone to 10s of thousands and continues to go up. > > Sincerely, > Artem > > -- > Founder, Android Police <http://www.androidpolice.com>, APK Mirror > <http://www.apkmirror.com/>, Illogical Robot LLC > beerpla.net | @ArtemR <http://twitter.com/ArtemR> > > > On Thu, Apr 30, 2020 at 2:57 PM Artem Russakovskii <archon810 at gmail.com> > wrote: > >> Hi all, >> >> Today, I decided to upgrade one of the four servers (citadel) we have to >> 7.5 from 5.13. There are 2 volumes, 1x4 replicate, and fuse mounts (I sent >> the full details earlier in another message). If everything looked OK, I >> would have proceeded the rolling upgrade for all of them, following the >> full heal. >> >> However, as soon as I upgraded and restarted, the logs filled with >> messages like these: >> >> [2020-04-30 21:39:21.316149] E >> [rpcsvc.c:567:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor >> (1298437:400:17) failed to complete successfully >> [2020-04-30 21:39:21.382891] E >> [rpcsvc.c:567:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor >> (1298437:400:17) failed to complete successfully >> [2020-04-30 21:39:21.442440] E >> [rpcsvc.c:567:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor >> (1298437:400:17) failed to complete successfully >> [2020-04-30 21:39:21.445587] E >> [rpcsvc.c:567:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor >> (1298437:400:17) failed to complete successfully >> [2020-04-30 21:39:21.571398] E >> [rpcsvc.c:567:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor >> (1298437:400:17) failed to complete successfully >> [2020-04-30 21:39:21.668192] E >> [rpcsvc.c:567:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor >> (1298437:400:17) failed to complete successfully >> >> >> The message "I [MSGID: 108031] >> [afr-common.c:2581:afr_local_discovery_cbk] >> 0-androidpolice_data3-replicate-0: selecting local read_child >> androidpolice_data3-client-3" repeated 10 times between [2020-04-30 >> 21:46:41.854675] and [2020-04-30 21:48:20.206323] >> The message "W [MSGID: 114031] >> [client-rpc-fops_v2.c:850:client4_0_setxattr_cbk] >> 0-androidpolice_data3-client-1: remote operation failed [Transport endpoint >> is not connected]" repeated 264 times between [2020-04-30 21:46:32.129567] >> and [2020-04-30 21:48:29.905008] >> The message "W [MSGID: 114031] >> [client-rpc-fops_v2.c:850:client4_0_setxattr_cbk] >> 0-androidpolice_data3-client-0: remote operation failed [Transport endpoint >> is not connected]" repeated 264 times between [2020-04-30 21:46:32.129602] >> and [2020-04-30 21:48:29.905040] >> The message "W [MSGID: 114031] >> [client-rpc-fops_v2.c:850:client4_0_setxattr_cbk] >> 0-androidpolice_data3-client-2: remote operation failed [Transport endpoint >> is not connected]" repeated 264 times between [2020-04-30 21:46:32.129512] >> and [2020-04-30 21:48:29.905047] >> >> >> >> Once in a while, I'm seeing this: >> ==> bricks/mnt-hive_block4-androidpolice_data3.log <=>> [2020-04-30 21:45:54.251637] I [MSGID: 115072] >> [server-rpc-fops_v2.c:1681:server4_setattr_cbk] >> 0-androidpolice_data3-server: 5725811: SETATTR / >> androidpolice.com/public/wp-content/uploads/2019/03/cielo-breez-plus-hero.png >> (d4556eb4-f15b-412c-a42a-32b4438af557), client: >> CTX_ID:32e2d636-038a-472d-8199-007555d1805f-GRAPH_ID:0-PID:14265-HOST:nexus2-PC_NAME:androidpolice_data3-client-2-RECON_NO:-1, >> error-xlator: androidpolice_data3-access-control [Operation not permitted] >> [2020-04-30 21:49:10.439701] I [MSGID: 115072] >> [server-rpc-fops_v2.c:1680:server4_setattr_cbk] >> 0-androidpolice_data3-server: 201833: SETATTR / >> androidpolice.com/public/wp-content/uploads >> (2692eeba-1ebe-49b6-927f-1dfbcd227591), client: >> CTX_ID:af341e80-70ff-4d23-99ef-3d846a546fc9-GRAPH_ID:0-PID:2358-HOST:forge-PC_NAME:androidpolice_data3-client-3-RECON_NO:-2, >> error-xlator: androidpolice_data3-access-control [Operation not permitted] >> [2020-04-30 21:49:10.453724] I [MSGID: 115072] >> [server-rpc-fops_v2.c:1680:server4_setattr_cbk] >> 0-androidpolice_data3-server: 201842: SETATTR / >> androidpolice.com/public/wp-content/uploads >> (2692eeba-1ebe-49b6-927f-1dfbcd227591), client: >> CTX_ID:af341e80-70ff-4d23-99ef-3d846a546fc9-GRAPH_ID:0-PID:2358-HOST:forge-PC_NAME:androidpolice_data3-client-3-RECON_NO:-2, >> error-xlator: androidpolice_data3-access-control [Operation not permitted] >> [2020-04-30 21:49:16.224662] I [MSGID: 115072] >> [server-rpc-fops_v2.c:1680:server4_setattr_cbk] >> 0-androidpolice_data3-server: 202865: SETATTR / >> androidpolice.com/public/wp-content/uploads >> (2692eeba-1ebe-49b6-927f-1dfbcd227591), client: >> CTX_ID:32e2d636-038a-472d-8199-007555d1805f-GRAPH_ID:0-PID:14265-HOST:nexus2-PC_NAME:androidpolice_data3-client-3-RECON_NO:-2, >> error-xlator: androidpolice_data3-access-control [Operation not permitted] >> >> There's also lots of self-healing happening that I didn't expect at all, >> since the upgrade only took ~10-15s. >> [2020-04-30 21:47:38.714448] I [MSGID: 108026] >> [afr-self-heal-metadata.c:52:__afr_selfheal_metadata_do] >> 0-apkmirror_data1-replicate-0: performing metadata selfheal on >> 4a6ba2d7-7ad8-4113-862b-02e4934a3461 >> [2020-04-30 21:47:38.765033] I [MSGID: 108026] >> [afr-self-heal-common.c:1723:afr_log_selfheal] >> 0-apkmirror_data1-replicate-0: Completed metadata selfheal on >> 4a6ba2d7-7ad8-4113-862b-02e4934a3461. sources=[3] sinks=0 1 2 >> [2020-04-30 21:47:38.765289] I [MSGID: 108026] >> [afr-self-heal-metadata.c:52:__afr_selfheal_metadata_do] >> 0-apkmirror_data1-replicate-0: performing metadata selfheal on >> f3c62a41-1864-4e75-9883-4357a7091296 >> [2020-04-30 21:47:38.800987] I [MSGID: 108026] >> [afr-self-heal-common.c:1723:afr_log_selfheal] >> 0-apkmirror_data1-replicate-0: Completed metadata selfheal on >> f3c62a41-1864-4e75-9883-4357a7091296. sources=[3] sinks=0 1 2 >> >> >> I'm also seeing "remote operation failed" and "writing to fuse device >> failed: No such file or directory" messages >> [2020-04-30 21:46:34.891957] I [MSGID: 108026] >> [afr-self-heal-common.c:1723:afr_log_selfheal] >> 0-androidpolice_data3-replicate-0: Completed metadata selfheal on >> 2692eeba-1ebe-49b6-927f-1dfbcd227591. sources=0 1 [2] sinks=3 >> [2020-04-30 21:45:36.127412] W [MSGID: 114031] >> [client-rpc-fops_v2.c:1985:client4_0_setattr_cbk] >> 0-androidpolice_data3-client-0: remote operation failed [Operation not >> permitted] >> [2020-04-30 21:45:36.345924] W [MSGID: 114031] >> [client-rpc-fops_v2.c:1985:client4_0_setattr_cbk] >> 0-androidpolice_data3-client-1: remote operation failed [Operation not >> permitted] >> [2020-04-30 21:46:35.291853] I [MSGID: 108031] >> [afr-common.c:2543:afr_local_discovery_cbk] >> 0-androidpolice_data3-replicate-0: selecting local read_child >> androidpolice_data3-client-2 >> [2020-04-30 21:46:35.977342] I [MSGID: 108026] >> [afr-self-heal-metadata.c:52:__afr_selfheal_metadata_do] >> 0-androidpolice_data3-replicate-0: performing metadata selfheal on >> 2692eeba-1ebe-49b6-927f-1dfbcd227591 >> [2020-04-30 21:46:36.006607] I [MSGID: 108026] >> [afr-self-heal-common.c:1723:afr_log_selfheal] >> 0-androidpolice_data3-replicate-0: Completed metadata selfheal on >> 2692eeba-1ebe-49b6-927f-1dfbcd227591. sources=0 1 [2] sinks=3 >> [2020-04-30 21:46:37.245599] E [fuse-bridge.c:219:check_and_dump_fuse_W] >> (--> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x17d)[0x7fd13d50624d] >> (--> >> /usr/lib64/glusterfs/5.13/xlator/mount/fuse.so(+0x849a)[0x7fd1398e949a] >> (--> >> /usr/lib64/glusterfs/5.13/xlator/mount/fuse.so(+0x87bb)[0x7fd1398e97bb] >> (--> /lib64/libpthread.so.0(+0x84f9)[0x7fd13ca564f9] (--> >> /lib64/libc.so.6(clone+0x3f)[0x7fd13c78ef2f] ))))) 0-glusterfs-fuse: >> writing to fuse device failed: No such file or directory >> [2020-04-30 21:46:50.864797] E [fuse-bridge.c:219:check_and_dump_fuse_W] >> (--> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x17d)[0x7fd13d50624d] >> (--> >> /usr/lib64/glusterfs/5.13/xlator/mount/fuse.so(+0x849a)[0x7fd1398e949a] >> (--> >> /usr/lib64/glusterfs/5.13/xlator/mount/fuse.so(+0x87bb)[0x7fd1398e97bb] >> (--> /lib64/libpthread.so.0(+0x84f9)[0x7fd13ca564f9] (--> >> /lib64/libc.so.6(clone+0x3f)[0x7fd13c78ef2f] ))))) 0-glusterfs-fuse: >> writing to fuse device failed: No such file or directory >> >> The number of items being healed is going up and down wildly, from 0 to >> 8000+ and sometimes taking a really long time to return a value. I'm really >> worried as this is a production system, and I didn't observe this in our >> test system. >> >> >> >> gluster v heal apkmirror_data1 info summary >> Brick nexus2:/mnt/nexus2_block1/apkmirror_data1 >> Status: Connected >> Total Number of entries: 27 >> Number of entries in heal pending: 27 >> Number of entries in split-brain: 0 >> Number of entries possibly healing: 0 >> >> Brick forge:/mnt/forge_block1/apkmirror_data1 >> Status: Connected >> Total Number of entries: 27 >> Number of entries in heal pending: 27 >> Number of entries in split-brain: 0 >> Number of entries possibly healing: 0 >> >> Brick hive:/mnt/hive_block1/apkmirror_data1 >> Status: Connected >> Total Number of entries: 27 >> Number of entries in heal pending: 27 >> Number of entries in split-brain: 0 >> Number of entries possibly healing: 0 >> >> Brick citadel:/mnt/citadel_block1/apkmirror_data1 >> Status: Connected >> Total Number of entries: 8540 >> Number of entries in heal pending: 8540 >> Number of entries in split-brain: 0 >> Number of entries possibly healing: 0 >> >> >> >> gluster v heal androidpolice_data3 info summary >> Brick nexus2:/mnt/nexus2_block4/androidpolice_data3 >> Status: Connected >> Total Number of entries: 1 >> Number of entries in heal pending: 1 >> Number of entries in split-brain: 0 >> Number of entries possibly healing: 0 >> >> Brick forge:/mnt/forge_block4/androidpolice_data3 >> Status: Connected >> Total Number of entries: 1 >> Number of entries in heal pending: 1 >> Number of entries in split-brain: 0 >> Number of entries possibly healing: 0 >> >> Brick hive:/mnt/hive_block4/androidpolice_data3 >> Status: Connected >> Total Number of entries: 1 >> Number of entries in heal pending: 1 >> Number of entries in split-brain: 0 >> Number of entries possibly healing: 0 >> >> Brick citadel:/mnt/citadel_block4/androidpolice_data3 >> Status: Connected >> Total Number of entries: 1149 >> Number of entries in heal pending: 1149 >> Number of entries in split-brain: 0 >> Number of entries possibly healing: 0 >> >> >> What should I do at this point? The files I tested seem to be replicating >> correctly, but I don't know if it's the case for all of them, and the heals >> going up and down, and all these log messages are making me very nervous. >> >> Thank you. >> >> Sincerely, >> Artem >> >> -- >> Founder, Android Police <http://www.androidpolice.com>, APK Mirror >> <http://www.apkmirror.com/>, Illogical Robot LLC >> beerpla.net | @ArtemR <http://twitter.com/ArtemR> >> >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20200430/1a16d954/attachment.html>
Strahil Nikolov
2020-May-01 08:07 UTC
[Gluster-users] Upgrade from 5.13 to 7.5 full of weird messages
On May 1, 2020 1:25:17 AM GMT+03:00, Artem Russakovskii <archon810 at gmail.com> wrote:>If more time is needed to analyze this, is this an option? Shut down >7.5, >downgrade it back to 5.13 and restart, or would this screw something up >badly? I didn't up the op-version yet. > >Thanks. > >Sincerely, >Artem > >-- >Founder, Android Police <http://www.androidpolice.com>, APK Mirror ><http://www.apkmirror.com/>, Illogical Robot LLC >beerpla.net | @ArtemR <http://twitter.com/ArtemR> > > >On Thu, Apr 30, 2020 at 3:13 PM Artem Russakovskii ><archon810 at gmail.com> >wrote: > >> The number of heal pending on citadel, the one that was upgraded to >7.5, >> has now gone to 10s of thousands and continues to go up. >> >> Sincerely, >> Artem >> >> -- >> Founder, Android Police <http://www.androidpolice.com>, APK Mirror >> <http://www.apkmirror.com/>, Illogical Robot LLC >> beerpla.net | @ArtemR <http://twitter.com/ArtemR> >> >> >> On Thu, Apr 30, 2020 at 2:57 PM Artem Russakovskii ><archon810 at gmail.com> >> wrote: >> >>> Hi all, >>> >>> Today, I decided to upgrade one of the four servers (citadel) we >have to >>> 7.5 from 5.13. There are 2 volumes, 1x4 replicate, and fuse mounts >(I sent >>> the full details earlier in another message). If everything looked >OK, I >>> would have proceeded the rolling upgrade for all of them, following >the >>> full heal. >>> >>> However, as soon as I upgraded and restarted, the logs filled with >>> messages like these: >>> >>> [2020-04-30 21:39:21.316149] E >>> [rpcsvc.c:567:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor >>> (1298437:400:17) failed to complete successfully >>> [2020-04-30 21:39:21.382891] E >>> [rpcsvc.c:567:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor >>> (1298437:400:17) failed to complete successfully >>> [2020-04-30 21:39:21.442440] E >>> [rpcsvc.c:567:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor >>> (1298437:400:17) failed to complete successfully >>> [2020-04-30 21:39:21.445587] E >>> [rpcsvc.c:567:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor >>> (1298437:400:17) failed to complete successfully >>> [2020-04-30 21:39:21.571398] E >>> [rpcsvc.c:567:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor >>> (1298437:400:17) failed to complete successfully >>> [2020-04-30 21:39:21.668192] E >>> [rpcsvc.c:567:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor >>> (1298437:400:17) failed to complete successfully >>> >>> >>> The message "I [MSGID: 108031] >>> [afr-common.c:2581:afr_local_discovery_cbk] >>> 0-androidpolice_data3-replicate-0: selecting local read_child >>> androidpolice_data3-client-3" repeated 10 times between [2020-04-30 >>> 21:46:41.854675] and [2020-04-30 21:48:20.206323] >>> The message "W [MSGID: 114031] >>> [client-rpc-fops_v2.c:850:client4_0_setxattr_cbk] >>> 0-androidpolice_data3-client-1: remote operation failed [Transport >endpoint >>> is not connected]" repeated 264 times between [2020-04-30 >21:46:32.129567] >>> and [2020-04-30 21:48:29.905008] >>> The message "W [MSGID: 114031] >>> [client-rpc-fops_v2.c:850:client4_0_setxattr_cbk] >>> 0-androidpolice_data3-client-0: remote operation failed [Transport >endpoint >>> is not connected]" repeated 264 times between [2020-04-30 >21:46:32.129602] >>> and [2020-04-30 21:48:29.905040] >>> The message "W [MSGID: 114031] >>> [client-rpc-fops_v2.c:850:client4_0_setxattr_cbk] >>> 0-androidpolice_data3-client-2: remote operation failed [Transport >endpoint >>> is not connected]" repeated 264 times between [2020-04-30 >21:46:32.129512] >>> and [2020-04-30 21:48:29.905047] >>> >>> >>> >>> Once in a while, I'm seeing this: >>> ==> bricks/mnt-hive_block4-androidpolice_data3.log <=>>> [2020-04-30 21:45:54.251637] I [MSGID: 115072] >>> [server-rpc-fops_v2.c:1681:server4_setattr_cbk] >>> 0-androidpolice_data3-server: 5725811: SETATTR / >>> >androidpolice.com/public/wp-content/uploads/2019/03/cielo-breez-plus-hero.png >>> (d4556eb4-f15b-412c-a42a-32b4438af557), client: >>> >CTX_ID:32e2d636-038a-472d-8199-007555d1805f-GRAPH_ID:0-PID:14265-HOST:nexus2-PC_NAME:androidpolice_data3-client-2-RECON_NO:-1, >>> error-xlator: androidpolice_data3-access-control [Operation not >permitted] >>> [2020-04-30 21:49:10.439701] I [MSGID: 115072] >>> [server-rpc-fops_v2.c:1680:server4_setattr_cbk] >>> 0-androidpolice_data3-server: 201833: SETATTR / >>> androidpolice.com/public/wp-content/uploads >>> (2692eeba-1ebe-49b6-927f-1dfbcd227591), client: >>> >CTX_ID:af341e80-70ff-4d23-99ef-3d846a546fc9-GRAPH_ID:0-PID:2358-HOST:forge-PC_NAME:androidpolice_data3-client-3-RECON_NO:-2, >>> error-xlator: androidpolice_data3-access-control [Operation not >permitted] >>> [2020-04-30 21:49:10.453724] I [MSGID: 115072] >>> [server-rpc-fops_v2.c:1680:server4_setattr_cbk] >>> 0-androidpolice_data3-server: 201842: SETATTR / >>> androidpolice.com/public/wp-content/uploads >>> (2692eeba-1ebe-49b6-927f-1dfbcd227591), client: >>> >CTX_ID:af341e80-70ff-4d23-99ef-3d846a546fc9-GRAPH_ID:0-PID:2358-HOST:forge-PC_NAME:androidpolice_data3-client-3-RECON_NO:-2, >>> error-xlator: androidpolice_data3-access-control [Operation not >permitted] >>> [2020-04-30 21:49:16.224662] I [MSGID: 115072] >>> [server-rpc-fops_v2.c:1680:server4_setattr_cbk] >>> 0-androidpolice_data3-server: 202865: SETATTR / >>> androidpolice.com/public/wp-content/uploads >>> (2692eeba-1ebe-49b6-927f-1dfbcd227591), client: >>> >CTX_ID:32e2d636-038a-472d-8199-007555d1805f-GRAPH_ID:0-PID:14265-HOST:nexus2-PC_NAME:androidpolice_data3-client-3-RECON_NO:-2, >>> error-xlator: androidpolice_data3-access-control [Operation not >permitted] >>> >>> There's also lots of self-healing happening that I didn't expect at >all, >>> since the upgrade only took ~10-15s. >>> [2020-04-30 21:47:38.714448] I [MSGID: 108026] >>> [afr-self-heal-metadata.c:52:__afr_selfheal_metadata_do] >>> 0-apkmirror_data1-replicate-0: performing metadata selfheal on >>> 4a6ba2d7-7ad8-4113-862b-02e4934a3461 >>> [2020-04-30 21:47:38.765033] I [MSGID: 108026] >>> [afr-self-heal-common.c:1723:afr_log_selfheal] >>> 0-apkmirror_data1-replicate-0: Completed metadata selfheal on >>> 4a6ba2d7-7ad8-4113-862b-02e4934a3461. sources=[3] sinks=0 1 2 >>> [2020-04-30 21:47:38.765289] I [MSGID: 108026] >>> [afr-self-heal-metadata.c:52:__afr_selfheal_metadata_do] >>> 0-apkmirror_data1-replicate-0: performing metadata selfheal on >>> f3c62a41-1864-4e75-9883-4357a7091296 >>> [2020-04-30 21:47:38.800987] I [MSGID: 108026] >>> [afr-self-heal-common.c:1723:afr_log_selfheal] >>> 0-apkmirror_data1-replicate-0: Completed metadata selfheal on >>> f3c62a41-1864-4e75-9883-4357a7091296. sources=[3] sinks=0 1 2 >>> >>> >>> I'm also seeing "remote operation failed" and "writing to fuse >device >>> failed: No such file or directory" messages >>> [2020-04-30 21:46:34.891957] I [MSGID: 108026] >>> [afr-self-heal-common.c:1723:afr_log_selfheal] >>> 0-androidpolice_data3-replicate-0: Completed metadata selfheal on >>> 2692eeba-1ebe-49b6-927f-1dfbcd227591. sources=0 1 [2] sinks=3 >>> [2020-04-30 21:45:36.127412] W [MSGID: 114031] >>> [client-rpc-fops_v2.c:1985:client4_0_setattr_cbk] >>> 0-androidpolice_data3-client-0: remote operation failed [Operation >not >>> permitted] >>> [2020-04-30 21:45:36.345924] W [MSGID: 114031] >>> [client-rpc-fops_v2.c:1985:client4_0_setattr_cbk] >>> 0-androidpolice_data3-client-1: remote operation failed [Operation >not >>> permitted] >>> [2020-04-30 21:46:35.291853] I [MSGID: 108031] >>> [afr-common.c:2543:afr_local_discovery_cbk] >>> 0-androidpolice_data3-replicate-0: selecting local read_child >>> androidpolice_data3-client-2 >>> [2020-04-30 21:46:35.977342] I [MSGID: 108026] >>> [afr-self-heal-metadata.c:52:__afr_selfheal_metadata_do] >>> 0-androidpolice_data3-replicate-0: performing metadata selfheal on >>> 2692eeba-1ebe-49b6-927f-1dfbcd227591 >>> [2020-04-30 21:46:36.006607] I [MSGID: 108026] >>> [afr-self-heal-common.c:1723:afr_log_selfheal] >>> 0-androidpolice_data3-replicate-0: Completed metadata selfheal on >>> 2692eeba-1ebe-49b6-927f-1dfbcd227591. sources=0 1 [2] sinks=3 >>> [2020-04-30 21:46:37.245599] E >[fuse-bridge.c:219:check_and_dump_fuse_W] >>> (--> >/usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x17d)[0x7fd13d50624d] >>> (--> >>> >/usr/lib64/glusterfs/5.13/xlator/mount/fuse.so(+0x849a)[0x7fd1398e949a] >>> (--> >>> >/usr/lib64/glusterfs/5.13/xlator/mount/fuse.so(+0x87bb)[0x7fd1398e97bb] >>> (--> /lib64/libpthread.so.0(+0x84f9)[0x7fd13ca564f9] (--> >>> /lib64/libc.so.6(clone+0x3f)[0x7fd13c78ef2f] ))))) 0-glusterfs-fuse: >>> writing to fuse device failed: No such file or directory >>> [2020-04-30 21:46:50.864797] E >[fuse-bridge.c:219:check_and_dump_fuse_W] >>> (--> >/usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x17d)[0x7fd13d50624d] >>> (--> >>> >/usr/lib64/glusterfs/5.13/xlator/mount/fuse.so(+0x849a)[0x7fd1398e949a] >>> (--> >>> >/usr/lib64/glusterfs/5.13/xlator/mount/fuse.so(+0x87bb)[0x7fd1398e97bb] >>> (--> /lib64/libpthread.so.0(+0x84f9)[0x7fd13ca564f9] (--> >>> /lib64/libc.so.6(clone+0x3f)[0x7fd13c78ef2f] ))))) 0-glusterfs-fuse: >>> writing to fuse device failed: No such file or directory >>> >>> The number of items being healed is going up and down wildly, from 0 >to >>> 8000+ and sometimes taking a really long time to return a value. I'm >really >>> worried as this is a production system, and I didn't observe this in >our >>> test system. >>> >>> >>> >>> gluster v heal apkmirror_data1 info summary >>> Brick nexus2:/mnt/nexus2_block1/apkmirror_data1 >>> Status: Connected >>> Total Number of entries: 27 >>> Number of entries in heal pending: 27 >>> Number of entries in split-brain: 0 >>> Number of entries possibly healing: 0 >>> >>> Brick forge:/mnt/forge_block1/apkmirror_data1 >>> Status: Connected >>> Total Number of entries: 27 >>> Number of entries in heal pending: 27 >>> Number of entries in split-brain: 0 >>> Number of entries possibly healing: 0 >>> >>> Brick hive:/mnt/hive_block1/apkmirror_data1 >>> Status: Connected >>> Total Number of entries: 27 >>> Number of entries in heal pending: 27 >>> Number of entries in split-brain: 0 >>> Number of entries possibly healing: 0 >>> >>> Brick citadel:/mnt/citadel_block1/apkmirror_data1 >>> Status: Connected >>> Total Number of entries: 8540 >>> Number of entries in heal pending: 8540 >>> Number of entries in split-brain: 0 >>> Number of entries possibly healing: 0 >>> >>> >>> >>> gluster v heal androidpolice_data3 info summary >>> Brick nexus2:/mnt/nexus2_block4/androidpolice_data3 >>> Status: Connected >>> Total Number of entries: 1 >>> Number of entries in heal pending: 1 >>> Number of entries in split-brain: 0 >>> Number of entries possibly healing: 0 >>> >>> Brick forge:/mnt/forge_block4/androidpolice_data3 >>> Status: Connected >>> Total Number of entries: 1 >>> Number of entries in heal pending: 1 >>> Number of entries in split-brain: 0 >>> Number of entries possibly healing: 0 >>> >>> Brick hive:/mnt/hive_block4/androidpolice_data3 >>> Status: Connected >>> Total Number of entries: 1 >>> Number of entries in heal pending: 1 >>> Number of entries in split-brain: 0 >>> Number of entries possibly healing: 0 >>> >>> Brick citadel:/mnt/citadel_block4/androidpolice_data3 >>> Status: Connected >>> Total Number of entries: 1149 >>> Number of entries in heal pending: 1149 >>> Number of entries in split-brain: 0 >>> Number of entries possibly healing: 0 >>> >>> >>> What should I do at this point? The files I tested seem to be >replicating >>> correctly, but I don't know if it's the case for all of them, and >the heals >>> going up and down, and all these log messages are making me very >nervous. >>> >>> Thank you. >>> >>> Sincerely, >>> Artem >>> >>> -- >>> Founder, Android Police <http://www.androidpolice.com>, APK Mirror >>> <http://www.apkmirror.com/>, Illogical Robot LLC >>> beerpla.net | @ArtemR <http://twitter.com/ArtemR> >>> >>I's not supported , but usually it works. In worst case scenario, you can remove the node, wipe gluster on the node, reinstall the packages and add it - it will require full heal of the brick and as you have previously reported could lead to performance degradation. I think you are on SLES, but I could be wrong . Do you have btrfs or LVM snapshots to revert from ? Best Regards, Strahil Nikolov