Artem Russakovskii
2020-May-15 21:51 UTC
[Gluster-users] Upgrade from 5.13 to 7.5 full of weird messages
Hi, I see the team met up recently and one of the discussed items was issues upgrading to v7. What were the results of this discussion? Is the team going to respond to this thread with their thoughts and analysis? Thanks. Sincerely, Artem -- Founder, Android Police <http://www.androidpolice.com>, APK Mirror <http://www.apkmirror.com/>, Illogical Robot LLC beerpla.net | @ArtemR <http://twitter.com/ArtemR> On Mon, May 4, 2020 at 10:23 PM Strahil Nikolov <hunter86_bg at yahoo.com> wrote:> On May 4, 2020 4:26:32 PM GMT+03:00, Amar Tumballi <amar at kadalu.io> wrote: > >On Sat, May 2, 2020 at 10:49 PM Artem Russakovskii > ><archon810 at gmail.com> > >wrote: > > > >> I don't have geo replication. > >> > >> Still waiting for someone from the gluster team to chime in. They > >used to > >> be a lot more responsive here. Do you know if there is a holiday > >perhaps, > >> or have the working hours been cut due to Coronavirus currently? > >> > >> > >It was Holiday on May 1st, and 2nd and 3rd were Weekend days! And also > >I > >guess many of Developers from Red Hat were attending Virtual Summit! > > > > > > > >> I'm not inclined to try a v6 upgrade without their word first. > >> > > > >Fair bet! I will bring this topic in one of the community meetings, and > >ask > >developers if they have some feedback! I personally have not seen these > >errors, and don't have a hunch on which patch would have caused an > >increase > >in logs! > > > >-Amar > > > > > >> > >> On Sat, May 2, 2020, 12:47 AM Strahil Nikolov <hunter86_bg at yahoo.com> > >> wrote: > >> > >>> On May 1, 2020 8:03:50 PM GMT+03:00, Artem Russakovskii < > >>> archon810 at gmail.com> wrote: > >>> >The good news is the downgrade seems to have worked and was > >painless. > >>> > > >>> >zypper install --oldpackage glusterfs-5.13, restart gluster, and > >almost > >>> >immediately there are no heal pending entries anymore. > >>> > > >>> >The only things still showing up in the logs, besides some healing > >is > >>> >0-glusterfs-fuse: > >>> >writing to fuse device failed: No such file or directory: > >>> >==> mnt-androidpolice_data3.log <=> >>> >[2020-05-01 16:54:21.085643] E > >>> >[fuse-bridge.c:219:check_and_dump_fuse_W] > >>> >(--> > >>> > >>/usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x17d)[0x7fd13d50624d] > >>> >(--> > >>> > >>/usr/lib64/glusterfs/5.13/xlator/mount/fuse.so(+0x849a)[0x7fd1398e949a] > >>> >(--> > >>> > >>/usr/lib64/glusterfs/5.13/xlator/mount/fuse.so(+0x87bb)[0x7fd1398e97bb] > >>> >(--> /lib64/libpthread.so.0(+0x84f9)[0x7fd13ca564f9] (--> > >>> >/lib64/libc.so.6(clone+0x3f)[0x7fd13c78ef2f] ))))) > >0-glusterfs-fuse: > >>> >writing to fuse device failed: No such file or directory > >>> >==> mnt-apkmirror_data1.log <=> >>> >[2020-05-01 16:54:21.268842] E > >>> >[fuse-bridge.c:219:check_and_dump_fuse_W] > >>> >(--> > >>> > >>/usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x17d)[0x7fdf2b0a624d] > >>> >(--> > >>> > >>/usr/lib64/glusterfs/5.13/xlator/mount/fuse.so(+0x849a)[0x7fdf2748949a] > >>> >(--> > >>> > >>/usr/lib64/glusterfs/5.13/xlator/mount/fuse.so(+0x87bb)[0x7fdf274897bb] > >>> >(--> /lib64/libpthread.so.0(+0x84f9)[0x7fdf2a5f64f9] (--> > >>> >/lib64/libc.so.6(clone+0x3f)[0x7fdf2a32ef2f] ))))) > >0-glusterfs-fuse: > >>> >writing to fuse device failed: No such file or directory > >>> > > >>> >It'd be very helpful if it had more info about what failed to write > >and > >>> >why. > >>> > > >>> >I'd still really love to see the analysis of this failed upgrade > >from > >>> >core > >>> >gluster maintainers to see what needs fixing and how we can upgrade > >in > >>> >the > >>> >future. > >>> > > >>> >Thanks. > >>> > > >>> >Sincerely, > >>> >Artem > >>> > > >>> >-- > >>> >Founder, Android Police <http://www.androidpolice.com>, APK Mirror > >>> ><http://www.apkmirror.com/>, Illogical Robot LLC > >>> >beerpla.net | @ArtemR <http://twitter.com/ArtemR> > >>> > > >>> > > >>> >On Fri, May 1, 2020 at 7:25 AM Artem Russakovskii > ><archon810 at gmail.com> > >>> >wrote: > >>> > > >>> >> I do not have snapshots, no. I have a general file based backup, > >but > >>> >also > >>> >> the other 3 nodes are up. > >>> >> > >>> >> OpenSUSE 15.1. > >>> >> > >>> >> If I try to downgrade and it doesn't work, what's the brick > >>> >replacement > >>> >> scenario - is this still accurate? > >>> >> > >>> > > >>> > > > https://docs.gluster.org/en/latest/Administrator%20Guide/Managing%20Volumes/#replace-brick > >>> >> > >>> >> Any feedback about the issues themselves yet please? > >Specifically, is > >>> >> there a chance this is happening because of the mismatched > >gluster > >>> >> versions? Though, what's the solution then? > >>> >> > >>> >> On Fri, May 1, 2020, 1:07 AM Strahil Nikolov > ><hunter86_bg at yahoo.com> > >>> >> wrote: > >>> >> > >>> >>> On May 1, 2020 1:25:17 AM GMT+03:00, Artem Russakovskii < > >>> >>> archon810 at gmail.com> wrote: > >>> >>> >If more time is needed to analyze this, is this an option? Shut > >>> >down > >>> >>> >7.5, > >>> >>> >downgrade it back to 5.13 and restart, or would this screw > >>> >something up > >>> >>> >badly? I didn't up the op-version yet. > >>> >>> > > >>> >>> >Thanks. > >>> >>> > > >>> >>> >Sincerely, > >>> >>> >Artem > >>> >>> > > >>> >>> >-- > >>> >>> >Founder, Android Police <http://www.androidpolice.com>, APK > >Mirror > >>> >>> ><http://www.apkmirror.com/>, Illogical Robot LLC > >>> >>> >beerpla.net | @ArtemR <http://twitter.com/ArtemR> > >>> >>> > > >>> >>> > > >>> >>> >On Thu, Apr 30, 2020 at 3:13 PM Artem Russakovskii > >>> >>> ><archon810 at gmail.com> > >>> >>> >wrote: > >>> >>> > > >>> >>> >> The number of heal pending on citadel, the one that was > >upgraded > >>> >to > >>> >>> >7.5, > >>> >>> >> has now gone to 10s of thousands and continues to go up. > >>> >>> >> > >>> >>> >> Sincerely, > >>> >>> >> Artem > >>> >>> >> > >>> >>> >> -- > >>> >>> >> Founder, Android Police <http://www.androidpolice.com>, APK > >>> >Mirror > >>> >>> >> <http://www.apkmirror.com/>, Illogical Robot LLC > >>> >>> >> beerpla.net | @ArtemR <http://twitter.com/ArtemR> > >>> >>> >> > >>> >>> >> > >>> >>> >> On Thu, Apr 30, 2020 at 2:57 PM Artem Russakovskii > >>> >>> ><archon810 at gmail.com> > >>> >>> >> wrote: > >>> >>> >> > >>> >>> >>> Hi all, > >>> >>> >>> > >>> >>> >>> Today, I decided to upgrade one of the four servers > >(citadel) we > >>> >>> >have to > >>> >>> >>> 7.5 from 5.13. There are 2 volumes, 1x4 replicate, and fuse > >>> >mounts > >>> >>> >(I sent > >>> >>> >>> the full details earlier in another message). If everything > >>> >looked > >>> >>> >OK, I > >>> >>> >>> would have proceeded the rolling upgrade for all of them, > >>> >following > >>> >>> >the > >>> >>> >>> full heal. > >>> >>> >>> > >>> >>> >>> However, as soon as I upgraded and restarted, the logs > >filled > >>> >with > >>> >>> >>> messages like these: > >>> >>> >>> > >>> >>> >>> [2020-04-30 21:39:21.316149] E > >>> >>> >>> [rpcsvc.c:567:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc > >actor > >>> >>> >>> (1298437:400:17) failed to complete successfully > >>> >>> >>> [2020-04-30 21:39:21.382891] E > >>> >>> >>> [rpcsvc.c:567:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc > >actor > >>> >>> >>> (1298437:400:17) failed to complete successfully > >>> >>> >>> [2020-04-30 21:39:21.442440] E > >>> >>> >>> [rpcsvc.c:567:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc > >actor > >>> >>> >>> (1298437:400:17) failed to complete successfully > >>> >>> >>> [2020-04-30 21:39:21.445587] E > >>> >>> >>> [rpcsvc.c:567:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc > >actor > >>> >>> >>> (1298437:400:17) failed to complete successfully > >>> >>> >>> [2020-04-30 21:39:21.571398] E > >>> >>> >>> [rpcsvc.c:567:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc > >actor > >>> >>> >>> (1298437:400:17) failed to complete successfully > >>> >>> >>> [2020-04-30 21:39:21.668192] E > >>> >>> >>> [rpcsvc.c:567:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc > >actor > >>> >>> >>> (1298437:400:17) failed to complete successfully > >>> >>> >>> > >>> >>> >>> > >>> >>> >>> The message "I [MSGID: 108031] > >>> >>> >>> [afr-common.c:2581:afr_local_discovery_cbk] > >>> >>> >>> 0-androidpolice_data3-replicate-0: selecting local > >read_child > >>> >>> >>> androidpolice_data3-client-3" repeated 10 times between > >>> >[2020-04-30 > >>> >>> >>> 21:46:41.854675] and [2020-04-30 21:48:20.206323] > >>> >>> >>> The message "W [MSGID: 114031] > >>> >>> >>> [client-rpc-fops_v2.c:850:client4_0_setxattr_cbk] > >>> >>> >>> 0-androidpolice_data3-client-1: remote operation failed > >>> >[Transport > >>> >>> >endpoint > >>> >>> >>> is not connected]" repeated 264 times between [2020-04-30 > >>> >>> >21:46:32.129567] > >>> >>> >>> and [2020-04-30 21:48:29.905008] > >>> >>> >>> The message "W [MSGID: 114031] > >>> >>> >>> [client-rpc-fops_v2.c:850:client4_0_setxattr_cbk] > >>> >>> >>> 0-androidpolice_data3-client-0: remote operation failed > >>> >[Transport > >>> >>> >endpoint > >>> >>> >>> is not connected]" repeated 264 times between [2020-04-30 > >>> >>> >21:46:32.129602] > >>> >>> >>> and [2020-04-30 21:48:29.905040] > >>> >>> >>> The message "W [MSGID: 114031] > >>> >>> >>> [client-rpc-fops_v2.c:850:client4_0_setxattr_cbk] > >>> >>> >>> 0-androidpolice_data3-client-2: remote operation failed > >>> >[Transport > >>> >>> >endpoint > >>> >>> >>> is not connected]" repeated 264 times between [2020-04-30 > >>> >>> >21:46:32.129512] > >>> >>> >>> and [2020-04-30 21:48:29.905047] > >>> >>> >>> > >>> >>> >>> > >>> >>> >>> > >>> >>> >>> Once in a while, I'm seeing this: > >>> >>> >>> ==> bricks/mnt-hive_block4-androidpolice_data3.log <=> >>> >>> >>> [2020-04-30 21:45:54.251637] I [MSGID: 115072] > >>> >>> >>> [server-rpc-fops_v2.c:1681:server4_setattr_cbk] > >>> >>> >>> 0-androidpolice_data3-server: 5725811: SETATTR / > >>> >>> >>> > >>> >>> > > >>> >>> > >>> > > >>> > > > androidpolice.com/public/wp-content/uploads/2019/03/cielo-breez-plus-hero.png > >>> >>> >>> (d4556eb4-f15b-412c-a42a-32b4438af557), client: > >>> >>> >>> > >>> >>> > >>> >>> > >>> > >>> > > >>>CTX_ID:32e2d636-038a-472d-8199-007555d1805f-GRAPH_ID:0-PID:14265-HOST:nexus2-PC_NAME:androidpolice_data3-client-2-RECON_NO:-1, > >>> >>> >>> error-xlator: androidpolice_data3-access-control [Operation > >not > >>> >>> >permitted] > >>> >>> >>> [2020-04-30 21:49:10.439701] I [MSGID: 115072] > >>> >>> >>> [server-rpc-fops_v2.c:1680:server4_setattr_cbk] > >>> >>> >>> 0-androidpolice_data3-server: 201833: SETATTR / > >>> >>> >>> androidpolice.com/public/wp-content/uploads > >>> >>> >>> (2692eeba-1ebe-49b6-927f-1dfbcd227591), client: > >>> >>> >>> > >>> >>> > >>> >>> > >>> > >>> > > >>>CTX_ID:af341e80-70ff-4d23-99ef-3d846a546fc9-GRAPH_ID:0-PID:2358-HOST:forge-PC_NAME:androidpolice_data3-client-3-RECON_NO:-2, > >>> >>> >>> error-xlator: androidpolice_data3-access-control [Operation > >not > >>> >>> >permitted] > >>> >>> >>> [2020-04-30 21:49:10.453724] I [MSGID: 115072] > >>> >>> >>> [server-rpc-fops_v2.c:1680:server4_setattr_cbk] > >>> >>> >>> 0-androidpolice_data3-server: 201842: SETATTR / > >>> >>> >>> androidpolice.com/public/wp-content/uploads > >>> >>> >>> (2692eeba-1ebe-49b6-927f-1dfbcd227591), client: > >>> >>> >>> > >>> >>> > >>> >>> > >>> > >>> > > >>>CTX_ID:af341e80-70ff-4d23-99ef-3d846a546fc9-GRAPH_ID:0-PID:2358-HOST:forge-PC_NAME:androidpolice_data3-client-3-RECON_NO:-2, > >>> >>> >>> error-xlator: androidpolice_data3-access-control [Operation > >not > >>> >>> >permitted] > >>> >>> >>> [2020-04-30 21:49:16.224662] I [MSGID: 115072] > >>> >>> >>> [server-rpc-fops_v2.c:1680:server4_setattr_cbk] > >>> >>> >>> 0-androidpolice_data3-server: 202865: SETATTR / > >>> >>> >>> androidpolice.com/public/wp-content/uploads > >>> >>> >>> (2692eeba-1ebe-49b6-927f-1dfbcd227591), client: > >>> >>> >>> > >>> >>> > >>> >>> > >>> > >>> > > >>>CTX_ID:32e2d636-038a-472d-8199-007555d1805f-GRAPH_ID:0-PID:14265-HOST:nexus2-PC_NAME:androidpolice_data3-client-3-RECON_NO:-2, > >>> >>> >>> error-xlator: androidpolice_data3-access-control [Operation > >not > >>> >>> >permitted] > >>> >>> >>> > >>> >>> >>> There's also lots of self-healing happening that I didn't > >expect > >>> >at > >>> >>> >all, > >>> >>> >>> since the upgrade only took ~10-15s. > >>> >>> >>> [2020-04-30 21:47:38.714448] I [MSGID: 108026] > >>> >>> >>> [afr-self-heal-metadata.c:52:__afr_selfheal_metadata_do] > >>> >>> >>> 0-apkmirror_data1-replicate-0: performing metadata selfheal > >on > >>> >>> >>> 4a6ba2d7-7ad8-4113-862b-02e4934a3461 > >>> >>> >>> [2020-04-30 21:47:38.765033] I [MSGID: 108026] > >>> >>> >>> [afr-self-heal-common.c:1723:afr_log_selfheal] > >>> >>> >>> 0-apkmirror_data1-replicate-0: Completed metadata selfheal > >on > >>> >>> >>> 4a6ba2d7-7ad8-4113-862b-02e4934a3461. sources=[3] sinks=0 1 > >2 > >>> >>> >>> [2020-04-30 21:47:38.765289] I [MSGID: 108026] > >>> >>> >>> [afr-self-heal-metadata.c:52:__afr_selfheal_metadata_do] > >>> >>> >>> 0-apkmirror_data1-replicate-0: performing metadata selfheal > >on > >>> >>> >>> f3c62a41-1864-4e75-9883-4357a7091296 > >>> >>> >>> [2020-04-30 21:47:38.800987] I [MSGID: 108026] > >>> >>> >>> [afr-self-heal-common.c:1723:afr_log_selfheal] > >>> >>> >>> 0-apkmirror_data1-replicate-0: Completed metadata selfheal > >on > >>> >>> >>> f3c62a41-1864-4e75-9883-4357a7091296. sources=[3] sinks=0 1 > >2 > >>> >>> >>> > >>> >>> >>> > >>> >>> >>> I'm also seeing "remote operation failed" and "writing to > >fuse > >>> >>> >device > >>> >>> >>> failed: No such file or directory" messages > >>> >>> >>> [2020-04-30 21:46:34.891957] I [MSGID: 108026] > >>> >>> >>> [afr-self-heal-common.c:1723:afr_log_selfheal] > >>> >>> >>> 0-androidpolice_data3-replicate-0: Completed metadata > >selfheal > >>> >on > >>> >>> >>> 2692eeba-1ebe-49b6-927f-1dfbcd227591. sources=0 1 [2] > >sinks=3 > >>> >>> >>> [2020-04-30 21:45:36.127412] W [MSGID: 114031] > >>> >>> >>> [client-rpc-fops_v2.c:1985:client4_0_setattr_cbk] > >>> >>> >>> 0-androidpolice_data3-client-0: remote operation failed > >>> >[Operation > >>> >>> >not > >>> >>> >>> permitted] > >>> >>> >>> [2020-04-30 21:45:36.345924] W [MSGID: 114031] > >>> >>> >>> [client-rpc-fops_v2.c:1985:client4_0_setattr_cbk] > >>> >>> >>> 0-androidpolice_data3-client-1: remote operation failed > >>> >[Operation > >>> >>> >not > >>> >>> >>> permitted] > >>> >>> >>> [2020-04-30 21:46:35.291853] I [MSGID: 108031] > >>> >>> >>> [afr-common.c:2543:afr_local_discovery_cbk] > >>> >>> >>> 0-androidpolice_data3-replicate-0: selecting local > >read_child > >>> >>> >>> androidpolice_data3-client-2 > >>> >>> >>> [2020-04-30 21:46:35.977342] I [MSGID: 108026] > >>> >>> >>> [afr-self-heal-metadata.c:52:__afr_selfheal_metadata_do] > >>> >>> >>> 0-androidpolice_data3-replicate-0: performing metadata > >selfheal > >>> >on > >>> >>> >>> 2692eeba-1ebe-49b6-927f-1dfbcd227591 > >>> >>> >>> [2020-04-30 21:46:36.006607] I [MSGID: 108026] > >>> >>> >>> [afr-self-heal-common.c:1723:afr_log_selfheal] > >>> >>> >>> 0-androidpolice_data3-replicate-0: Completed metadata > >selfheal > >>> >on > >>> >>> >>> 2692eeba-1ebe-49b6-927f-1dfbcd227591. sources=0 1 [2] > >sinks=3 > >>> >>> >>> [2020-04-30 21:46:37.245599] E > >>> >>> >[fuse-bridge.c:219:check_and_dump_fuse_W] > >>> >>> >>> (--> > >>> >>> > >>> > >>>/usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x17d)[0x7fd13d50624d] > >>> >>> >>> (--> > >>> >>> >>> > >>> >>> > >>> > >>>/usr/lib64/glusterfs/5.13/xlator/mount/fuse.so(+0x849a)[0x7fd1398e949a] > >>> >>> >>> (--> > >>> >>> >>> > >>> >>> > >>> > >>>/usr/lib64/glusterfs/5.13/xlator/mount/fuse.so(+0x87bb)[0x7fd1398e97bb] > >>> >>> >>> (--> /lib64/libpthread.so.0(+0x84f9)[0x7fd13ca564f9] (--> > >>> >>> >>> /lib64/libc.so.6(clone+0x3f)[0x7fd13c78ef2f] ))))) > >>> >0-glusterfs-fuse: > >>> >>> >>> writing to fuse device failed: No such file or directory > >>> >>> >>> [2020-04-30 21:46:50.864797] E > >>> >>> >[fuse-bridge.c:219:check_and_dump_fuse_W] > >>> >>> >>> (--> > >>> >>> > >>> > >>>/usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x17d)[0x7fd13d50624d] > >>> >>> >>> (--> > >>> >>> >>> > >>> >>> > >>> > >>>/usr/lib64/glusterfs/5.13/xlator/mount/fuse.so(+0x849a)[0x7fd1398e949a] > >>> >>> >>> (--> > >>> >>> >>> > >>> >>> > >>> > >>>/usr/lib64/glusterfs/5.13/xlator/mount/fuse.so(+0x87bb)[0x7fd1398e97bb] > >>> >>> >>> (--> /lib64/libpthread.so.0(+0x84f9)[0x7fd13ca564f9] (--> > >>> >>> >>> /lib64/libc.so.6(clone+0x3f)[0x7fd13c78ef2f] ))))) > >>> >0-glusterfs-fuse: > >>> >>> >>> writing to fuse device failed: No such file or directory > >>> >>> >>> > >>> >>> >>> The number of items being healed is going up and down > >wildly, > >>> >from 0 > >>> >>> >to > >>> >>> >>> 8000+ and sometimes taking a really long time to return a > >value. > >>> >I'm > >>> >>> >really > >>> >>> >>> worried as this is a production system, and I didn't observe > >>> >this in > >>> >>> >our > >>> >>> >>> test system. > >>> >>> >>> > >>> >>> >>> > >>> >>> >>> > >>> >>> >>> gluster v heal apkmirror_data1 info summary > >>> >>> >>> Brick nexus2:/mnt/nexus2_block1/apkmirror_data1 > >>> >>> >>> Status: Connected > >>> >>> >>> Total Number of entries: 27 > >>> >>> >>> Number of entries in heal pending: 27 > >>> >>> >>> Number of entries in split-brain: 0 > >>> >>> >>> Number of entries possibly healing: 0 > >>> >>> >>> > >>> >>> >>> Brick forge:/mnt/forge_block1/apkmirror_data1 > >>> >>> >>> Status: Connected > >>> >>> >>> Total Number of entries: 27 > >>> >>> >>> Number of entries in heal pending: 27 > >>> >>> >>> Number of entries in split-brain: 0 > >>> >>> >>> Number of entries possibly healing: 0 > >>> >>> >>> > >>> >>> >>> Brick hive:/mnt/hive_block1/apkmirror_data1 > >>> >>> >>> Status: Connected > >>> >>> >>> Total Number of entries: 27 > >>> >>> >>> Number of entries in heal pending: 27 > >>> >>> >>> Number of entries in split-brain: 0 > >>> >>> >>> Number of entries possibly healing: 0 > >>> >>> >>> > >>> >>> >>> Brick citadel:/mnt/citadel_block1/apkmirror_data1 > >>> >>> >>> Status: Connected > >>> >>> >>> Total Number of entries: 8540 > >>> >>> >>> Number of entries in heal pending: 8540 > >>> >>> >>> Number of entries in split-brain: 0 > >>> >>> >>> Number of entries possibly healing: 0 > >>> >>> >>> > >>> >>> >>> > >>> >>> >>> > >>> >>> >>> gluster v heal androidpolice_data3 info summary > >>> >>> >>> Brick nexus2:/mnt/nexus2_block4/androidpolice_data3 > >>> >>> >>> Status: Connected > >>> >>> >>> Total Number of entries: 1 > >>> >>> >>> Number of entries in heal pending: 1 > >>> >>> >>> Number of entries in split-brain: 0 > >>> >>> >>> Number of entries possibly healing: 0 > >>> >>> >>> > >>> >>> >>> Brick forge:/mnt/forge_block4/androidpolice_data3 > >>> >>> >>> Status: Connected > >>> >>> >>> Total Number of entries: 1 > >>> >>> >>> Number of entries in heal pending: 1 > >>> >>> >>> Number of entries in split-brain: 0 > >>> >>> >>> Number of entries possibly healing: 0 > >>> >>> >>> > >>> >>> >>> Brick hive:/mnt/hive_block4/androidpolice_data3 > >>> >>> >>> Status: Connected > >>> >>> >>> Total Number of entries: 1 > >>> >>> >>> Number of entries in heal pending: 1 > >>> >>> >>> Number of entries in split-brain: 0 > >>> >>> >>> Number of entries possibly healing: 0 > >>> >>> >>> > >>> >>> >>> Brick citadel:/mnt/citadel_block4/androidpolice_data3 > >>> >>> >>> Status: Connected > >>> >>> >>> Total Number of entries: 1149 > >>> >>> >>> Number of entries in heal pending: 1149 > >>> >>> >>> Number of entries in split-brain: 0 > >>> >>> >>> Number of entries possibly healing: 0 > >>> >>> >>> > >>> >>> >>> > >>> >>> >>> What should I do at this point? The files I tested seem to > >be > >>> >>> >replicating > >>> >>> >>> correctly, but I don't know if it's the case for all of > >them, > >>> >and > >>> >>> >the heals > >>> >>> >>> going up and down, and all these log messages are making me > >very > >>> >>> >nervous. > >>> >>> >>> > >>> >>> >>> Thank you. > >>> >>> >>> > >>> >>> >>> Sincerely, > >>> >>> >>> Artem > >>> >>> >>> > >>> >>> >>> -- > >>> >>> >>> Founder, Android Police <http://www.androidpolice.com>, APK > >>> >Mirror > >>> >>> >>> <http://www.apkmirror.com/>, Illogical Robot LLC > >>> >>> >>> beerpla.net | @ArtemR <http://twitter.com/ArtemR> > >>> >>> >>> > >>> >>> >> > >>> >>> > >>> >>> I's not supported , but usually it works. > >>> >>> > >>> >>> In worst case scenario, you can remove the node, wipe gluster > >on > >>> >the > >>> >>> node, reinstall the packages and add it - it will require full > >heal > >>> >of the > >>> >>> brick and as you have previously reported could lead to > >performance > >>> >>> degradation. > >>> >>> > >>> >>> I think you are on SLES, but I could be wrong . Do you have > >btrfs or > >>> >LVM > >>> >>> snapshots to revert from ? > >>> >>> > >>> >>> Best Regards, > >>> >>> Strahil Nikolov > >>> >>> > >>> >> > >>> > >>> Hi Artem, > >>> > >>> You can increase the brick log level following > >>> > > > https://access.redhat.com/documentation/en-us/red_hat_gluster_storage/3/html/administration_guide/configuring_the_log_level > >>> but keep in mind that logs grow quite fast - so don't keep them > >above the > >>> current level for too much time. > >>> > >>> > >>> Do you have a geo replication running ? > >>> > >>> About the migration issue - I have no clue why this happened. Last > >time I > >>> skipped a major release(3.12 to 5.5) I got a huge trouble (all > >files > >>> ownership was switched to root) and I have the feeling that it > >won't > >>> happen again if you go through v6. > >>> > >>> Best Regards, > >>> Strahil Nikolov > >>> > >> ________ > >> > >> > >> > >> Community Meeting Calendar: > >> > >> Schedule - > >> Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC > >> Bridge: https://bluejeans.com/441850968 > >> > >> Gluster-users mailing list > >> Gluster-users at gluster.org > >> https://lists.gluster.org/mailman/listinfo/gluster-users > >> > > Hey Artem, > > I just checked if the 'replica 4' is causing the issue , but that's not > true (tested with 1 node down, but it's the same situation). > > I created 4 VMs on CentOS 7 & Gluster v7.5 (brick has only noatime mount > option) and created a 'replica 4' volume. > Then I created a dir and placed 50000 very small files there via: > for i in {1..50000}; do echo $RANDOM > $i ; done > > The find command 'finds' them in 4s and after some tuning I have managed > to lower it to 2.5s. > > What has caused some improvement was: > A) Activated the rhgs-random-io tuned profile which you can take from > ftp://ftp.redhat.com/redhat/linux/enterprise/7Server/en/RHS/SRPMS/redhat-storage-server-3.5.0.0-1.el7rhgs.src.rpm > B) using noatime for the mount option and if you use SELINUX you could > use the 'context=system_u:object_r:glusterd_brick_t:s0' mount option to > prevent selinux context lookups > C) Activation of the gluster group of settings 'metadata-cache' or > 'nl-cache' brought 'find' to the same results - lowered from 3.5s to 2.5s > after an initial run. > > I know that I'm not compairing apples to apples , but still it might help. > > I would like to learn what actually gluster does when a 'find' or 'ls' is > invoked, as I doubt it just executes it on the bricks. > > Best Regards, > Strahil Nikolov >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20200515/0b2f27ba/attachment.html>
Artem Russakovskii
2020-May-21 19:43 UTC
[Gluster-users] Upgrade from 5.13 to 7.5 full of weird messages
I've also moved this to github: https://github.com/gluster/glusterfs/issues/1257. Sincerely, Artem -- Founder, Android Police <http://www.androidpolice.com>, APK Mirror <http://www.apkmirror.com/>, Illogical Robot LLC beerpla.net | @ArtemR <http://twitter.com/ArtemR> On Fri, May 15, 2020 at 2:51 PM Artem Russakovskii <archon810 at gmail.com> wrote:> Hi, > > I see the team met up recently and one of the discussed items was issues > upgrading to v7. What were the results of this discussion? > > Is the team going to respond to this thread with their thoughts and > analysis? > > Thanks. > > Sincerely, > Artem > > -- > Founder, Android Police <http://www.androidpolice.com>, APK Mirror > <http://www.apkmirror.com/>, Illogical Robot LLC > beerpla.net | @ArtemR <http://twitter.com/ArtemR> > > > On Mon, May 4, 2020 at 10:23 PM Strahil Nikolov <hunter86_bg at yahoo.com> > wrote: > >> On May 4, 2020 4:26:32 PM GMT+03:00, Amar Tumballi <amar at kadalu.io> >> wrote: >> >On Sat, May 2, 2020 at 10:49 PM Artem Russakovskii >> ><archon810 at gmail.com> >> >wrote: >> > >> >> I don't have geo replication. >> >> >> >> Still waiting for someone from the gluster team to chime in. They >> >used to >> >> be a lot more responsive here. Do you know if there is a holiday >> >perhaps, >> >> or have the working hours been cut due to Coronavirus currently? >> >> >> >> >> >It was Holiday on May 1st, and 2nd and 3rd were Weekend days! And also >> >I >> >guess many of Developers from Red Hat were attending Virtual Summit! >> > >> > >> > >> >> I'm not inclined to try a v6 upgrade without their word first. >> >> >> > >> >Fair bet! I will bring this topic in one of the community meetings, and >> >ask >> >developers if they have some feedback! I personally have not seen these >> >errors, and don't have a hunch on which patch would have caused an >> >increase >> >in logs! >> > >> >-Amar >> > >> > >> >> >> >> On Sat, May 2, 2020, 12:47 AM Strahil Nikolov <hunter86_bg at yahoo.com> >> >> wrote: >> >> >> >>> On May 1, 2020 8:03:50 PM GMT+03:00, Artem Russakovskii < >> >>> archon810 at gmail.com> wrote: >> >>> >The good news is the downgrade seems to have worked and was >> >painless. >> >>> > >> >>> >zypper install --oldpackage glusterfs-5.13, restart gluster, and >> >almost >> >>> >immediately there are no heal pending entries anymore. >> >>> > >> >>> >The only things still showing up in the logs, besides some healing >> >is >> >>> >0-glusterfs-fuse: >> >>> >writing to fuse device failed: No such file or directory: >> >>> >==> mnt-androidpolice_data3.log <=>> >>> >[2020-05-01 16:54:21.085643] E >> >>> >[fuse-bridge.c:219:check_and_dump_fuse_W] >> >>> >(--> >> >>> >> >>/usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x17d)[0x7fd13d50624d] >> >>> >(--> >> >>> >> >>/usr/lib64/glusterfs/5.13/xlator/mount/fuse.so(+0x849a)[0x7fd1398e949a] >> >>> >(--> >> >>> >> >>/usr/lib64/glusterfs/5.13/xlator/mount/fuse.so(+0x87bb)[0x7fd1398e97bb] >> >>> >(--> /lib64/libpthread.so.0(+0x84f9)[0x7fd13ca564f9] (--> >> >>> >/lib64/libc.so.6(clone+0x3f)[0x7fd13c78ef2f] ))))) >> >0-glusterfs-fuse: >> >>> >writing to fuse device failed: No such file or directory >> >>> >==> mnt-apkmirror_data1.log <=>> >>> >[2020-05-01 16:54:21.268842] E >> >>> >[fuse-bridge.c:219:check_and_dump_fuse_W] >> >>> >(--> >> >>> >> >>/usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x17d)[0x7fdf2b0a624d] >> >>> >(--> >> >>> >> >>/usr/lib64/glusterfs/5.13/xlator/mount/fuse.so(+0x849a)[0x7fdf2748949a] >> >>> >(--> >> >>> >> >>/usr/lib64/glusterfs/5.13/xlator/mount/fuse.so(+0x87bb)[0x7fdf274897bb] >> >>> >(--> /lib64/libpthread.so.0(+0x84f9)[0x7fdf2a5f64f9] (--> >> >>> >/lib64/libc.so.6(clone+0x3f)[0x7fdf2a32ef2f] ))))) >> >0-glusterfs-fuse: >> >>> >writing to fuse device failed: No such file or directory >> >>> > >> >>> >It'd be very helpful if it had more info about what failed to write >> >and >> >>> >why. >> >>> > >> >>> >I'd still really love to see the analysis of this failed upgrade >> >from >> >>> >core >> >>> >gluster maintainers to see what needs fixing and how we can upgrade >> >in >> >>> >the >> >>> >future. >> >>> > >> >>> >Thanks. >> >>> > >> >>> >Sincerely, >> >>> >Artem >> >>> > >> >>> >-- >> >>> >Founder, Android Police <http://www.androidpolice.com>, APK Mirror >> >>> ><http://www.apkmirror.com/>, Illogical Robot LLC >> >>> >beerpla.net | @ArtemR <http://twitter.com/ArtemR> >> >>> > >> >>> > >> >>> >On Fri, May 1, 2020 at 7:25 AM Artem Russakovskii >> ><archon810 at gmail.com> >> >>> >wrote: >> >>> > >> >>> >> I do not have snapshots, no. I have a general file based backup, >> >but >> >>> >also >> >>> >> the other 3 nodes are up. >> >>> >> >> >>> >> OpenSUSE 15.1. >> >>> >> >> >>> >> If I try to downgrade and it doesn't work, what's the brick >> >>> >replacement >> >>> >> scenario - is this still accurate? >> >>> >> >> >>> > >> >>> >> > >> https://docs.gluster.org/en/latest/Administrator%20Guide/Managing%20Volumes/#replace-brick >> >>> >> >> >>> >> Any feedback about the issues themselves yet please? >> >Specifically, is >> >>> >> there a chance this is happening because of the mismatched >> >gluster >> >>> >> versions? Though, what's the solution then? >> >>> >> >> >>> >> On Fri, May 1, 2020, 1:07 AM Strahil Nikolov >> ><hunter86_bg at yahoo.com> >> >>> >> wrote: >> >>> >> >> >>> >>> On May 1, 2020 1:25:17 AM GMT+03:00, Artem Russakovskii < >> >>> >>> archon810 at gmail.com> wrote: >> >>> >>> >If more time is needed to analyze this, is this an option? Shut >> >>> >down >> >>> >>> >7.5, >> >>> >>> >downgrade it back to 5.13 and restart, or would this screw >> >>> >something up >> >>> >>> >badly? I didn't up the op-version yet. >> >>> >>> > >> >>> >>> >Thanks. >> >>> >>> > >> >>> >>> >Sincerely, >> >>> >>> >Artem >> >>> >>> > >> >>> >>> >-- >> >>> >>> >Founder, Android Police <http://www.androidpolice.com>, APK >> >Mirror >> >>> >>> ><http://www.apkmirror.com/>, Illogical Robot LLC >> >>> >>> >beerpla.net | @ArtemR <http://twitter.com/ArtemR> >> >>> >>> > >> >>> >>> > >> >>> >>> >On Thu, Apr 30, 2020 at 3:13 PM Artem Russakovskii >> >>> >>> ><archon810 at gmail.com> >> >>> >>> >wrote: >> >>> >>> > >> >>> >>> >> The number of heal pending on citadel, the one that was >> >upgraded >> >>> >to >> >>> >>> >7.5, >> >>> >>> >> has now gone to 10s of thousands and continues to go up. >> >>> >>> >> >> >>> >>> >> Sincerely, >> >>> >>> >> Artem >> >>> >>> >> >> >>> >>> >> -- >> >>> >>> >> Founder, Android Police <http://www.androidpolice.com>, APK >> >>> >Mirror >> >>> >>> >> <http://www.apkmirror.com/>, Illogical Robot LLC >> >>> >>> >> beerpla.net | @ArtemR <http://twitter.com/ArtemR> >> >>> >>> >> >> >>> >>> >> >> >>> >>> >> On Thu, Apr 30, 2020 at 2:57 PM Artem Russakovskii >> >>> >>> ><archon810 at gmail.com> >> >>> >>> >> wrote: >> >>> >>> >> >> >>> >>> >>> Hi all, >> >>> >>> >>> >> >>> >>> >>> Today, I decided to upgrade one of the four servers >> >(citadel) we >> >>> >>> >have to >> >>> >>> >>> 7.5 from 5.13. There are 2 volumes, 1x4 replicate, and fuse >> >>> >mounts >> >>> >>> >(I sent >> >>> >>> >>> the full details earlier in another message). If everything >> >>> >looked >> >>> >>> >OK, I >> >>> >>> >>> would have proceeded the rolling upgrade for all of them, >> >>> >following >> >>> >>> >the >> >>> >>> >>> full heal. >> >>> >>> >>> >> >>> >>> >>> However, as soon as I upgraded and restarted, the logs >> >filled >> >>> >with >> >>> >>> >>> messages like these: >> >>> >>> >>> >> >>> >>> >>> [2020-04-30 21:39:21.316149] E >> >>> >>> >>> [rpcsvc.c:567:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc >> >actor >> >>> >>> >>> (1298437:400:17) failed to complete successfully >> >>> >>> >>> [2020-04-30 21:39:21.382891] E >> >>> >>> >>> [rpcsvc.c:567:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc >> >actor >> >>> >>> >>> (1298437:400:17) failed to complete successfully >> >>> >>> >>> [2020-04-30 21:39:21.442440] E >> >>> >>> >>> [rpcsvc.c:567:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc >> >actor >> >>> >>> >>> (1298437:400:17) failed to complete successfully >> >>> >>> >>> [2020-04-30 21:39:21.445587] E >> >>> >>> >>> [rpcsvc.c:567:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc >> >actor >> >>> >>> >>> (1298437:400:17) failed to complete successfully >> >>> >>> >>> [2020-04-30 21:39:21.571398] E >> >>> >>> >>> [rpcsvc.c:567:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc >> >actor >> >>> >>> >>> (1298437:400:17) failed to complete successfully >> >>> >>> >>> [2020-04-30 21:39:21.668192] E >> >>> >>> >>> [rpcsvc.c:567:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc >> >actor >> >>> >>> >>> (1298437:400:17) failed to complete successfully >> >>> >>> >>> >> >>> >>> >>> >> >>> >>> >>> The message "I [MSGID: 108031] >> >>> >>> >>> [afr-common.c:2581:afr_local_discovery_cbk] >> >>> >>> >>> 0-androidpolice_data3-replicate-0: selecting local >> >read_child >> >>> >>> >>> androidpolice_data3-client-3" repeated 10 times between >> >>> >[2020-04-30 >> >>> >>> >>> 21:46:41.854675] and [2020-04-30 21:48:20.206323] >> >>> >>> >>> The message "W [MSGID: 114031] >> >>> >>> >>> [client-rpc-fops_v2.c:850:client4_0_setxattr_cbk] >> >>> >>> >>> 0-androidpolice_data3-client-1: remote operation failed >> >>> >[Transport >> >>> >>> >endpoint >> >>> >>> >>> is not connected]" repeated 264 times between [2020-04-30 >> >>> >>> >21:46:32.129567] >> >>> >>> >>> and [2020-04-30 21:48:29.905008] >> >>> >>> >>> The message "W [MSGID: 114031] >> >>> >>> >>> [client-rpc-fops_v2.c:850:client4_0_setxattr_cbk] >> >>> >>> >>> 0-androidpolice_data3-client-0: remote operation failed >> >>> >[Transport >> >>> >>> >endpoint >> >>> >>> >>> is not connected]" repeated 264 times between [2020-04-30 >> >>> >>> >21:46:32.129602] >> >>> >>> >>> and [2020-04-30 21:48:29.905040] >> >>> >>> >>> The message "W [MSGID: 114031] >> >>> >>> >>> [client-rpc-fops_v2.c:850:client4_0_setxattr_cbk] >> >>> >>> >>> 0-androidpolice_data3-client-2: remote operation failed >> >>> >[Transport >> >>> >>> >endpoint >> >>> >>> >>> is not connected]" repeated 264 times between [2020-04-30 >> >>> >>> >21:46:32.129512] >> >>> >>> >>> and [2020-04-30 21:48:29.905047] >> >>> >>> >>> >> >>> >>> >>> >> >>> >>> >>> >> >>> >>> >>> Once in a while, I'm seeing this: >> >>> >>> >>> ==> bricks/mnt-hive_block4-androidpolice_data3.log <=>> >>> >>> >>> [2020-04-30 21:45:54.251637] I [MSGID: 115072] >> >>> >>> >>> [server-rpc-fops_v2.c:1681:server4_setattr_cbk] >> >>> >>> >>> 0-androidpolice_data3-server: 5725811: SETATTR / >> >>> >>> >>> >> >>> >>> > >> >>> >>> >> >>> > >> >>> >> > >> androidpolice.com/public/wp-content/uploads/2019/03/cielo-breez-plus-hero.png >> >>> >>> >>> (d4556eb4-f15b-412c-a42a-32b4438af557), client: >> >>> >>> >>> >> >>> >>> >> >>> >>> >> >>> >> >>> >> >> >>>CTX_ID:32e2d636-038a-472d-8199-007555d1805f-GRAPH_ID:0-PID:14265-HOST:nexus2-PC_NAME:androidpolice_data3-client-2-RECON_NO:-1, >> >>> >>> >>> error-xlator: androidpolice_data3-access-control [Operation >> >not >> >>> >>> >permitted] >> >>> >>> >>> [2020-04-30 21:49:10.439701] I [MSGID: 115072] >> >>> >>> >>> [server-rpc-fops_v2.c:1680:server4_setattr_cbk] >> >>> >>> >>> 0-androidpolice_data3-server: 201833: SETATTR / >> >>> >>> >>> androidpolice.com/public/wp-content/uploads >> >>> >>> >>> (2692eeba-1ebe-49b6-927f-1dfbcd227591), client: >> >>> >>> >>> >> >>> >>> >> >>> >>> >> >>> >> >>> >> >> >>>CTX_ID:af341e80-70ff-4d23-99ef-3d846a546fc9-GRAPH_ID:0-PID:2358-HOST:forge-PC_NAME:androidpolice_data3-client-3-RECON_NO:-2, >> >>> >>> >>> error-xlator: androidpolice_data3-access-control [Operation >> >not >> >>> >>> >permitted] >> >>> >>> >>> [2020-04-30 21:49:10.453724] I [MSGID: 115072] >> >>> >>> >>> [server-rpc-fops_v2.c:1680:server4_setattr_cbk] >> >>> >>> >>> 0-androidpolice_data3-server: 201842: SETATTR / >> >>> >>> >>> androidpolice.com/public/wp-content/uploads >> >>> >>> >>> (2692eeba-1ebe-49b6-927f-1dfbcd227591), client: >> >>> >>> >>> >> >>> >>> >> >>> >>> >> >>> >> >>> >> >> >>>CTX_ID:af341e80-70ff-4d23-99ef-3d846a546fc9-GRAPH_ID:0-PID:2358-HOST:forge-PC_NAME:androidpolice_data3-client-3-RECON_NO:-2, >> >>> >>> >>> error-xlator: androidpolice_data3-access-control [Operation >> >not >> >>> >>> >permitted] >> >>> >>> >>> [2020-04-30 21:49:16.224662] I [MSGID: 115072] >> >>> >>> >>> [server-rpc-fops_v2.c:1680:server4_setattr_cbk] >> >>> >>> >>> 0-androidpolice_data3-server: 202865: SETATTR / >> >>> >>> >>> androidpolice.com/public/wp-content/uploads >> >>> >>> >>> (2692eeba-1ebe-49b6-927f-1dfbcd227591), client: >> >>> >>> >>> >> >>> >>> >> >>> >>> >> >>> >> >>> >> >> >>>CTX_ID:32e2d636-038a-472d-8199-007555d1805f-GRAPH_ID:0-PID:14265-HOST:nexus2-PC_NAME:androidpolice_data3-client-3-RECON_NO:-2, >> >>> >>> >>> error-xlator: androidpolice_data3-access-control [Operation >> >not >> >>> >>> >permitted] >> >>> >>> >>> >> >>> >>> >>> There's also lots of self-healing happening that I didn't >> >expect >> >>> >at >> >>> >>> >all, >> >>> >>> >>> since the upgrade only took ~10-15s. >> >>> >>> >>> [2020-04-30 21:47:38.714448] I [MSGID: 108026] >> >>> >>> >>> [afr-self-heal-metadata.c:52:__afr_selfheal_metadata_do] >> >>> >>> >>> 0-apkmirror_data1-replicate-0: performing metadata selfheal >> >on >> >>> >>> >>> 4a6ba2d7-7ad8-4113-862b-02e4934a3461 >> >>> >>> >>> [2020-04-30 21:47:38.765033] I [MSGID: 108026] >> >>> >>> >>> [afr-self-heal-common.c:1723:afr_log_selfheal] >> >>> >>> >>> 0-apkmirror_data1-replicate-0: Completed metadata selfheal >> >on >> >>> >>> >>> 4a6ba2d7-7ad8-4113-862b-02e4934a3461. sources=[3] sinks=0 1 >> >2 >> >>> >>> >>> [2020-04-30 21:47:38.765289] I [MSGID: 108026] >> >>> >>> >>> [afr-self-heal-metadata.c:52:__afr_selfheal_metadata_do] >> >>> >>> >>> 0-apkmirror_data1-replicate-0: performing metadata selfheal >> >on >> >>> >>> >>> f3c62a41-1864-4e75-9883-4357a7091296 >> >>> >>> >>> [2020-04-30 21:47:38.800987] I [MSGID: 108026] >> >>> >>> >>> [afr-self-heal-common.c:1723:afr_log_selfheal] >> >>> >>> >>> 0-apkmirror_data1-replicate-0: Completed metadata selfheal >> >on >> >>> >>> >>> f3c62a41-1864-4e75-9883-4357a7091296. sources=[3] sinks=0 1 >> >2 >> >>> >>> >>> >> >>> >>> >>> >> >>> >>> >>> I'm also seeing "remote operation failed" and "writing to >> >fuse >> >>> >>> >device >> >>> >>> >>> failed: No such file or directory" messages >> >>> >>> >>> [2020-04-30 21:46:34.891957] I [MSGID: 108026] >> >>> >>> >>> [afr-self-heal-common.c:1723:afr_log_selfheal] >> >>> >>> >>> 0-androidpolice_data3-replicate-0: Completed metadata >> >selfheal >> >>> >on >> >>> >>> >>> 2692eeba-1ebe-49b6-927f-1dfbcd227591. sources=0 1 [2] >> >sinks=3 >> >>> >>> >>> [2020-04-30 21:45:36.127412] W [MSGID: 114031] >> >>> >>> >>> [client-rpc-fops_v2.c:1985:client4_0_setattr_cbk] >> >>> >>> >>> 0-androidpolice_data3-client-0: remote operation failed >> >>> >[Operation >> >>> >>> >not >> >>> >>> >>> permitted] >> >>> >>> >>> [2020-04-30 21:45:36.345924] W [MSGID: 114031] >> >>> >>> >>> [client-rpc-fops_v2.c:1985:client4_0_setattr_cbk] >> >>> >>> >>> 0-androidpolice_data3-client-1: remote operation failed >> >>> >[Operation >> >>> >>> >not >> >>> >>> >>> permitted] >> >>> >>> >>> [2020-04-30 21:46:35.291853] I [MSGID: 108031] >> >>> >>> >>> [afr-common.c:2543:afr_local_discovery_cbk] >> >>> >>> >>> 0-androidpolice_data3-replicate-0: selecting local >> >read_child >> >>> >>> >>> androidpolice_data3-client-2 >> >>> >>> >>> [2020-04-30 21:46:35.977342] I [MSGID: 108026] >> >>> >>> >>> [afr-self-heal-metadata.c:52:__afr_selfheal_metadata_do] >> >>> >>> >>> 0-androidpolice_data3-replicate-0: performing metadata >> >selfheal >> >>> >on >> >>> >>> >>> 2692eeba-1ebe-49b6-927f-1dfbcd227591 >> >>> >>> >>> [2020-04-30 21:46:36.006607] I [MSGID: 108026] >> >>> >>> >>> [afr-self-heal-common.c:1723:afr_log_selfheal] >> >>> >>> >>> 0-androidpolice_data3-replicate-0: Completed metadata >> >selfheal >> >>> >on >> >>> >>> >>> 2692eeba-1ebe-49b6-927f-1dfbcd227591. sources=0 1 [2] >> >sinks=3 >> >>> >>> >>> [2020-04-30 21:46:37.245599] E >> >>> >>> >[fuse-bridge.c:219:check_and_dump_fuse_W] >> >>> >>> >>> (--> >> >>> >>> >> >>> >> >>>/usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x17d)[0x7fd13d50624d] >> >>> >>> >>> (--> >> >>> >>> >>> >> >>> >>> >> >>> >> >>>/usr/lib64/glusterfs/5.13/xlator/mount/fuse.so(+0x849a)[0x7fd1398e949a] >> >>> >>> >>> (--> >> >>> >>> >>> >> >>> >>> >> >>> >> >>>/usr/lib64/glusterfs/5.13/xlator/mount/fuse.so(+0x87bb)[0x7fd1398e97bb] >> >>> >>> >>> (--> /lib64/libpthread.so.0(+0x84f9)[0x7fd13ca564f9] (--> >> >>> >>> >>> /lib64/libc.so.6(clone+0x3f)[0x7fd13c78ef2f] ))))) >> >>> >0-glusterfs-fuse: >> >>> >>> >>> writing to fuse device failed: No such file or directory >> >>> >>> >>> [2020-04-30 21:46:50.864797] E >> >>> >>> >[fuse-bridge.c:219:check_and_dump_fuse_W] >> >>> >>> >>> (--> >> >>> >>> >> >>> >> >>>/usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x17d)[0x7fd13d50624d] >> >>> >>> >>> (--> >> >>> >>> >>> >> >>> >>> >> >>> >> >>>/usr/lib64/glusterfs/5.13/xlator/mount/fuse.so(+0x849a)[0x7fd1398e949a] >> >>> >>> >>> (--> >> >>> >>> >>> >> >>> >>> >> >>> >> >>>/usr/lib64/glusterfs/5.13/xlator/mount/fuse.so(+0x87bb)[0x7fd1398e97bb] >> >>> >>> >>> (--> /lib64/libpthread.so.0(+0x84f9)[0x7fd13ca564f9] (--> >> >>> >>> >>> /lib64/libc.so.6(clone+0x3f)[0x7fd13c78ef2f] ))))) >> >>> >0-glusterfs-fuse: >> >>> >>> >>> writing to fuse device failed: No such file or directory >> >>> >>> >>> >> >>> >>> >>> The number of items being healed is going up and down >> >wildly, >> >>> >from 0 >> >>> >>> >to >> >>> >>> >>> 8000+ and sometimes taking a really long time to return a >> >value. >> >>> >I'm >> >>> >>> >really >> >>> >>> >>> worried as this is a production system, and I didn't observe >> >>> >this in >> >>> >>> >our >> >>> >>> >>> test system. >> >>> >>> >>> >> >>> >>> >>> >> >>> >>> >>> >> >>> >>> >>> gluster v heal apkmirror_data1 info summary >> >>> >>> >>> Brick nexus2:/mnt/nexus2_block1/apkmirror_data1 >> >>> >>> >>> Status: Connected >> >>> >>> >>> Total Number of entries: 27 >> >>> >>> >>> Number of entries in heal pending: 27 >> >>> >>> >>> Number of entries in split-brain: 0 >> >>> >>> >>> Number of entries possibly healing: 0 >> >>> >>> >>> >> >>> >>> >>> Brick forge:/mnt/forge_block1/apkmirror_data1 >> >>> >>> >>> Status: Connected >> >>> >>> >>> Total Number of entries: 27 >> >>> >>> >>> Number of entries in heal pending: 27 >> >>> >>> >>> Number of entries in split-brain: 0 >> >>> >>> >>> Number of entries possibly healing: 0 >> >>> >>> >>> >> >>> >>> >>> Brick hive:/mnt/hive_block1/apkmirror_data1 >> >>> >>> >>> Status: Connected >> >>> >>> >>> Total Number of entries: 27 >> >>> >>> >>> Number of entries in heal pending: 27 >> >>> >>> >>> Number of entries in split-brain: 0 >> >>> >>> >>> Number of entries possibly healing: 0 >> >>> >>> >>> >> >>> >>> >>> Brick citadel:/mnt/citadel_block1/apkmirror_data1 >> >>> >>> >>> Status: Connected >> >>> >>> >>> Total Number of entries: 8540 >> >>> >>> >>> Number of entries in heal pending: 8540 >> >>> >>> >>> Number of entries in split-brain: 0 >> >>> >>> >>> Number of entries possibly healing: 0 >> >>> >>> >>> >> >>> >>> >>> >> >>> >>> >>> >> >>> >>> >>> gluster v heal androidpolice_data3 info summary >> >>> >>> >>> Brick nexus2:/mnt/nexus2_block4/androidpolice_data3 >> >>> >>> >>> Status: Connected >> >>> >>> >>> Total Number of entries: 1 >> >>> >>> >>> Number of entries in heal pending: 1 >> >>> >>> >>> Number of entries in split-brain: 0 >> >>> >>> >>> Number of entries possibly healing: 0 >> >>> >>> >>> >> >>> >>> >>> Brick forge:/mnt/forge_block4/androidpolice_data3 >> >>> >>> >>> Status: Connected >> >>> >>> >>> Total Number of entries: 1 >> >>> >>> >>> Number of entries in heal pending: 1 >> >>> >>> >>> Number of entries in split-brain: 0 >> >>> >>> >>> Number of entries possibly healing: 0 >> >>> >>> >>> >> >>> >>> >>> Brick hive:/mnt/hive_block4/androidpolice_data3 >> >>> >>> >>> Status: Connected >> >>> >>> >>> Total Number of entries: 1 >> >>> >>> >>> Number of entries in heal pending: 1 >> >>> >>> >>> Number of entries in split-brain: 0 >> >>> >>> >>> Number of entries possibly healing: 0 >> >>> >>> >>> >> >>> >>> >>> Brick citadel:/mnt/citadel_block4/androidpolice_data3 >> >>> >>> >>> Status: Connected >> >>> >>> >>> Total Number of entries: 1149 >> >>> >>> >>> Number of entries in heal pending: 1149 >> >>> >>> >>> Number of entries in split-brain: 0 >> >>> >>> >>> Number of entries possibly healing: 0 >> >>> >>> >>> >> >>> >>> >>> >> >>> >>> >>> What should I do at this point? The files I tested seem to >> >be >> >>> >>> >replicating >> >>> >>> >>> correctly, but I don't know if it's the case for all of >> >them, >> >>> >and >> >>> >>> >the heals >> >>> >>> >>> going up and down, and all these log messages are making me >> >very >> >>> >>> >nervous. >> >>> >>> >>> >> >>> >>> >>> Thank you. >> >>> >>> >>> >> >>> >>> >>> Sincerely, >> >>> >>> >>> Artem >> >>> >>> >>> >> >>> >>> >>> -- >> >>> >>> >>> Founder, Android Police <http://www.androidpolice.com>, APK >> >>> >Mirror >> >>> >>> >>> <http://www.apkmirror.com/>, Illogical Robot LLC >> >>> >>> >>> beerpla.net | @ArtemR <http://twitter.com/ArtemR> >> >>> >>> >>> >> >>> >>> >> >> >>> >>> >> >>> >>> I's not supported , but usually it works. >> >>> >>> >> >>> >>> In worst case scenario, you can remove the node, wipe gluster >> >on >> >>> >the >> >>> >>> node, reinstall the packages and add it - it will require full >> >heal >> >>> >of the >> >>> >>> brick and as you have previously reported could lead to >> >performance >> >>> >>> degradation. >> >>> >>> >> >>> >>> I think you are on SLES, but I could be wrong . Do you have >> >btrfs or >> >>> >LVM >> >>> >>> snapshots to revert from ? >> >>> >>> >> >>> >>> Best Regards, >> >>> >>> Strahil Nikolov >> >>> >>> >> >>> >> >> >>> >> >>> Hi Artem, >> >>> >> >>> You can increase the brick log level following >> >>> >> > >> https://access.redhat.com/documentation/en-us/red_hat_gluster_storage/3/html/administration_guide/configuring_the_log_level >> >>> but keep in mind that logs grow quite fast - so don't keep them >> >above the >> >>> current level for too much time. >> >>> >> >>> >> >>> Do you have a geo replication running ? >> >>> >> >>> About the migration issue - I have no clue why this happened. Last >> >time I >> >>> skipped a major release(3.12 to 5.5) I got a huge trouble (all >> >files >> >>> ownership was switched to root) and I have the feeling that it >> >won't >> >>> happen again if you go through v6. >> >>> >> >>> Best Regards, >> >>> Strahil Nikolov >> >>> >> >> ________ >> >> >> >> >> >> >> >> Community Meeting Calendar: >> >> >> >> Schedule - >> >> Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC >> >> Bridge: https://bluejeans.com/441850968 >> >> >> >> Gluster-users mailing list >> >> Gluster-users at gluster.org >> >> https://lists.gluster.org/mailman/listinfo/gluster-users >> >> >> >> Hey Artem, >> >> I just checked if the 'replica 4' is causing the issue , but that's not >> true (tested with 1 node down, but it's the same situation). >> >> I created 4 VMs on CentOS 7 & Gluster v7.5 (brick has only noatime mount >> option) and created a 'replica 4' volume. >> Then I created a dir and placed 50000 very small files there via: >> for i in {1..50000}; do echo $RANDOM > $i ; done >> >> The find command 'finds' them in 4s and after some tuning I have managed >> to lower it to 2.5s. >> >> What has caused some improvement was: >> A) Activated the rhgs-random-io tuned profile which you can take from >> ftp://ftp.redhat.com/redhat/linux/enterprise/7Server/en/RHS/SRPMS/redhat-storage-server-3.5.0.0-1.el7rhgs.src.rpm >> B) using noatime for the mount option and if you use SELINUX you could >> use the 'context=system_u:object_r:glusterd_brick_t:s0' mount option to >> prevent selinux context lookups >> C) Activation of the gluster group of settings 'metadata-cache' or >> 'nl-cache' brought 'find' to the same results - lowered from 3.5s to 2.5s >> after an initial run. >> >> I know that I'm not compairing apples to apples , but still it might help. >> >> I would like to learn what actually gluster does when a 'find' or 'ls' is >> invoked, as I doubt it just executes it on the bricks. >> >> Best Regards, >> Strahil Nikolov >> >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20200521/d6b6c60a/attachment.html>