Artem Russakovskii
2020-May-02 17:18 UTC
[Gluster-users] Upgrade from 5.13 to 7.5 full of weird messages
I don't have geo replication. Still waiting for someone from the gluster team to chime in. They used to be a lot more responsive here. Do you know if there is a holiday perhaps, or have the working hours been cut due to Coronavirus currently? I'm not inclined to try a v6 upgrade without their word first. On Sat, May 2, 2020, 12:47 AM Strahil Nikolov <hunter86_bg at yahoo.com> wrote:> On May 1, 2020 8:03:50 PM GMT+03:00, Artem Russakovskii < > archon810 at gmail.com> wrote: > >The good news is the downgrade seems to have worked and was painless. > > > >zypper install --oldpackage glusterfs-5.13, restart gluster, and almost > >immediately there are no heal pending entries anymore. > > > >The only things still showing up in the logs, besides some healing is > >0-glusterfs-fuse: > >writing to fuse device failed: No such file or directory: > >==> mnt-androidpolice_data3.log <=> >[2020-05-01 16:54:21.085643] E > >[fuse-bridge.c:219:check_and_dump_fuse_W] > >(--> > >/usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x17d)[0x7fd13d50624d] > >(--> > >/usr/lib64/glusterfs/5.13/xlator/mount/fuse.so(+0x849a)[0x7fd1398e949a] > >(--> > >/usr/lib64/glusterfs/5.13/xlator/mount/fuse.so(+0x87bb)[0x7fd1398e97bb] > >(--> /lib64/libpthread.so.0(+0x84f9)[0x7fd13ca564f9] (--> > >/lib64/libc.so.6(clone+0x3f)[0x7fd13c78ef2f] ))))) 0-glusterfs-fuse: > >writing to fuse device failed: No such file or directory > >==> mnt-apkmirror_data1.log <=> >[2020-05-01 16:54:21.268842] E > >[fuse-bridge.c:219:check_and_dump_fuse_W] > >(--> > >/usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x17d)[0x7fdf2b0a624d] > >(--> > >/usr/lib64/glusterfs/5.13/xlator/mount/fuse.so(+0x849a)[0x7fdf2748949a] > >(--> > >/usr/lib64/glusterfs/5.13/xlator/mount/fuse.so(+0x87bb)[0x7fdf274897bb] > >(--> /lib64/libpthread.so.0(+0x84f9)[0x7fdf2a5f64f9] (--> > >/lib64/libc.so.6(clone+0x3f)[0x7fdf2a32ef2f] ))))) 0-glusterfs-fuse: > >writing to fuse device failed: No such file or directory > > > >It'd be very helpful if it had more info about what failed to write and > >why. > > > >I'd still really love to see the analysis of this failed upgrade from > >core > >gluster maintainers to see what needs fixing and how we can upgrade in > >the > >future. > > > >Thanks. > > > >Sincerely, > >Artem > > > >-- > >Founder, Android Police <http://www.androidpolice.com>, APK Mirror > ><http://www.apkmirror.com/>, Illogical Robot LLC > >beerpla.net | @ArtemR <http://twitter.com/ArtemR> > > > > > >On Fri, May 1, 2020 at 7:25 AM Artem Russakovskii <archon810 at gmail.com> > >wrote: > > > >> I do not have snapshots, no. I have a general file based backup, but > >also > >> the other 3 nodes are up. > >> > >> OpenSUSE 15.1. > >> > >> If I try to downgrade and it doesn't work, what's the brick > >replacement > >> scenario - is this still accurate? > >> > > > https://docs.gluster.org/en/latest/Administrator%20Guide/Managing%20Volumes/#replace-brick > >> > >> Any feedback about the issues themselves yet please? Specifically, is > >> there a chance this is happening because of the mismatched gluster > >> versions? Though, what's the solution then? > >> > >> On Fri, May 1, 2020, 1:07 AM Strahil Nikolov <hunter86_bg at yahoo.com> > >> wrote: > >> > >>> On May 1, 2020 1:25:17 AM GMT+03:00, Artem Russakovskii < > >>> archon810 at gmail.com> wrote: > >>> >If more time is needed to analyze this, is this an option? Shut > >down > >>> >7.5, > >>> >downgrade it back to 5.13 and restart, or would this screw > >something up > >>> >badly? I didn't up the op-version yet. > >>> > > >>> >Thanks. > >>> > > >>> >Sincerely, > >>> >Artem > >>> > > >>> >-- > >>> >Founder, Android Police <http://www.androidpolice.com>, APK Mirror > >>> ><http://www.apkmirror.com/>, Illogical Robot LLC > >>> >beerpla.net | @ArtemR <http://twitter.com/ArtemR> > >>> > > >>> > > >>> >On Thu, Apr 30, 2020 at 3:13 PM Artem Russakovskii > >>> ><archon810 at gmail.com> > >>> >wrote: > >>> > > >>> >> The number of heal pending on citadel, the one that was upgraded > >to > >>> >7.5, > >>> >> has now gone to 10s of thousands and continues to go up. > >>> >> > >>> >> Sincerely, > >>> >> Artem > >>> >> > >>> >> -- > >>> >> Founder, Android Police <http://www.androidpolice.com>, APK > >Mirror > >>> >> <http://www.apkmirror.com/>, Illogical Robot LLC > >>> >> beerpla.net | @ArtemR <http://twitter.com/ArtemR> > >>> >> > >>> >> > >>> >> On Thu, Apr 30, 2020 at 2:57 PM Artem Russakovskii > >>> ><archon810 at gmail.com> > >>> >> wrote: > >>> >> > >>> >>> Hi all, > >>> >>> > >>> >>> Today, I decided to upgrade one of the four servers (citadel) we > >>> >have to > >>> >>> 7.5 from 5.13. There are 2 volumes, 1x4 replicate, and fuse > >mounts > >>> >(I sent > >>> >>> the full details earlier in another message). If everything > >looked > >>> >OK, I > >>> >>> would have proceeded the rolling upgrade for all of them, > >following > >>> >the > >>> >>> full heal. > >>> >>> > >>> >>> However, as soon as I upgraded and restarted, the logs filled > >with > >>> >>> messages like these: > >>> >>> > >>> >>> [2020-04-30 21:39:21.316149] E > >>> >>> [rpcsvc.c:567:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor > >>> >>> (1298437:400:17) failed to complete successfully > >>> >>> [2020-04-30 21:39:21.382891] E > >>> >>> [rpcsvc.c:567:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor > >>> >>> (1298437:400:17) failed to complete successfully > >>> >>> [2020-04-30 21:39:21.442440] E > >>> >>> [rpcsvc.c:567:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor > >>> >>> (1298437:400:17) failed to complete successfully > >>> >>> [2020-04-30 21:39:21.445587] E > >>> >>> [rpcsvc.c:567:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor > >>> >>> (1298437:400:17) failed to complete successfully > >>> >>> [2020-04-30 21:39:21.571398] E > >>> >>> [rpcsvc.c:567:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor > >>> >>> (1298437:400:17) failed to complete successfully > >>> >>> [2020-04-30 21:39:21.668192] E > >>> >>> [rpcsvc.c:567:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor > >>> >>> (1298437:400:17) failed to complete successfully > >>> >>> > >>> >>> > >>> >>> The message "I [MSGID: 108031] > >>> >>> [afr-common.c:2581:afr_local_discovery_cbk] > >>> >>> 0-androidpolice_data3-replicate-0: selecting local read_child > >>> >>> androidpolice_data3-client-3" repeated 10 times between > >[2020-04-30 > >>> >>> 21:46:41.854675] and [2020-04-30 21:48:20.206323] > >>> >>> The message "W [MSGID: 114031] > >>> >>> [client-rpc-fops_v2.c:850:client4_0_setxattr_cbk] > >>> >>> 0-androidpolice_data3-client-1: remote operation failed > >[Transport > >>> >endpoint > >>> >>> is not connected]" repeated 264 times between [2020-04-30 > >>> >21:46:32.129567] > >>> >>> and [2020-04-30 21:48:29.905008] > >>> >>> The message "W [MSGID: 114031] > >>> >>> [client-rpc-fops_v2.c:850:client4_0_setxattr_cbk] > >>> >>> 0-androidpolice_data3-client-0: remote operation failed > >[Transport > >>> >endpoint > >>> >>> is not connected]" repeated 264 times between [2020-04-30 > >>> >21:46:32.129602] > >>> >>> and [2020-04-30 21:48:29.905040] > >>> >>> The message "W [MSGID: 114031] > >>> >>> [client-rpc-fops_v2.c:850:client4_0_setxattr_cbk] > >>> >>> 0-androidpolice_data3-client-2: remote operation failed > >[Transport > >>> >endpoint > >>> >>> is not connected]" repeated 264 times between [2020-04-30 > >>> >21:46:32.129512] > >>> >>> and [2020-04-30 21:48:29.905047] > >>> >>> > >>> >>> > >>> >>> > >>> >>> Once in a while, I'm seeing this: > >>> >>> ==> bricks/mnt-hive_block4-androidpolice_data3.log <=> >>> >>> [2020-04-30 21:45:54.251637] I [MSGID: 115072] > >>> >>> [server-rpc-fops_v2.c:1681:server4_setattr_cbk] > >>> >>> 0-androidpolice_data3-server: 5725811: SETATTR / > >>> >>> > >>> > > >>> > > > androidpolice.com/public/wp-content/uploads/2019/03/cielo-breez-plus-hero.png > >>> >>> (d4556eb4-f15b-412c-a42a-32b4438af557), client: > >>> >>> > >>> > >>> > > >>CTX_ID:32e2d636-038a-472d-8199-007555d1805f-GRAPH_ID:0-PID:14265-HOST:nexus2-PC_NAME:androidpolice_data3-client-2-RECON_NO:-1, > >>> >>> error-xlator: androidpolice_data3-access-control [Operation not > >>> >permitted] > >>> >>> [2020-04-30 21:49:10.439701] I [MSGID: 115072] > >>> >>> [server-rpc-fops_v2.c:1680:server4_setattr_cbk] > >>> >>> 0-androidpolice_data3-server: 201833: SETATTR / > >>> >>> androidpolice.com/public/wp-content/uploads > >>> >>> (2692eeba-1ebe-49b6-927f-1dfbcd227591), client: > >>> >>> > >>> > >>> > > >>CTX_ID:af341e80-70ff-4d23-99ef-3d846a546fc9-GRAPH_ID:0-PID:2358-HOST:forge-PC_NAME:androidpolice_data3-client-3-RECON_NO:-2, > >>> >>> error-xlator: androidpolice_data3-access-control [Operation not > >>> >permitted] > >>> >>> [2020-04-30 21:49:10.453724] I [MSGID: 115072] > >>> >>> [server-rpc-fops_v2.c:1680:server4_setattr_cbk] > >>> >>> 0-androidpolice_data3-server: 201842: SETATTR / > >>> >>> androidpolice.com/public/wp-content/uploads > >>> >>> (2692eeba-1ebe-49b6-927f-1dfbcd227591), client: > >>> >>> > >>> > >>> > > >>CTX_ID:af341e80-70ff-4d23-99ef-3d846a546fc9-GRAPH_ID:0-PID:2358-HOST:forge-PC_NAME:androidpolice_data3-client-3-RECON_NO:-2, > >>> >>> error-xlator: androidpolice_data3-access-control [Operation not > >>> >permitted] > >>> >>> [2020-04-30 21:49:16.224662] I [MSGID: 115072] > >>> >>> [server-rpc-fops_v2.c:1680:server4_setattr_cbk] > >>> >>> 0-androidpolice_data3-server: 202865: SETATTR / > >>> >>> androidpolice.com/public/wp-content/uploads > >>> >>> (2692eeba-1ebe-49b6-927f-1dfbcd227591), client: > >>> >>> > >>> > >>> > > >>CTX_ID:32e2d636-038a-472d-8199-007555d1805f-GRAPH_ID:0-PID:14265-HOST:nexus2-PC_NAME:androidpolice_data3-client-3-RECON_NO:-2, > >>> >>> error-xlator: androidpolice_data3-access-control [Operation not > >>> >permitted] > >>> >>> > >>> >>> There's also lots of self-healing happening that I didn't expect > >at > >>> >all, > >>> >>> since the upgrade only took ~10-15s. > >>> >>> [2020-04-30 21:47:38.714448] I [MSGID: 108026] > >>> >>> [afr-self-heal-metadata.c:52:__afr_selfheal_metadata_do] > >>> >>> 0-apkmirror_data1-replicate-0: performing metadata selfheal on > >>> >>> 4a6ba2d7-7ad8-4113-862b-02e4934a3461 > >>> >>> [2020-04-30 21:47:38.765033] I [MSGID: 108026] > >>> >>> [afr-self-heal-common.c:1723:afr_log_selfheal] > >>> >>> 0-apkmirror_data1-replicate-0: Completed metadata selfheal on > >>> >>> 4a6ba2d7-7ad8-4113-862b-02e4934a3461. sources=[3] sinks=0 1 2 > >>> >>> [2020-04-30 21:47:38.765289] I [MSGID: 108026] > >>> >>> [afr-self-heal-metadata.c:52:__afr_selfheal_metadata_do] > >>> >>> 0-apkmirror_data1-replicate-0: performing metadata selfheal on > >>> >>> f3c62a41-1864-4e75-9883-4357a7091296 > >>> >>> [2020-04-30 21:47:38.800987] I [MSGID: 108026] > >>> >>> [afr-self-heal-common.c:1723:afr_log_selfheal] > >>> >>> 0-apkmirror_data1-replicate-0: Completed metadata selfheal on > >>> >>> f3c62a41-1864-4e75-9883-4357a7091296. sources=[3] sinks=0 1 2 > >>> >>> > >>> >>> > >>> >>> I'm also seeing "remote operation failed" and "writing to fuse > >>> >device > >>> >>> failed: No such file or directory" messages > >>> >>> [2020-04-30 21:46:34.891957] I [MSGID: 108026] > >>> >>> [afr-self-heal-common.c:1723:afr_log_selfheal] > >>> >>> 0-androidpolice_data3-replicate-0: Completed metadata selfheal > >on > >>> >>> 2692eeba-1ebe-49b6-927f-1dfbcd227591. sources=0 1 [2] sinks=3 > >>> >>> [2020-04-30 21:45:36.127412] W [MSGID: 114031] > >>> >>> [client-rpc-fops_v2.c:1985:client4_0_setattr_cbk] > >>> >>> 0-androidpolice_data3-client-0: remote operation failed > >[Operation > >>> >not > >>> >>> permitted] > >>> >>> [2020-04-30 21:45:36.345924] W [MSGID: 114031] > >>> >>> [client-rpc-fops_v2.c:1985:client4_0_setattr_cbk] > >>> >>> 0-androidpolice_data3-client-1: remote operation failed > >[Operation > >>> >not > >>> >>> permitted] > >>> >>> [2020-04-30 21:46:35.291853] I [MSGID: 108031] > >>> >>> [afr-common.c:2543:afr_local_discovery_cbk] > >>> >>> 0-androidpolice_data3-replicate-0: selecting local read_child > >>> >>> androidpolice_data3-client-2 > >>> >>> [2020-04-30 21:46:35.977342] I [MSGID: 108026] > >>> >>> [afr-self-heal-metadata.c:52:__afr_selfheal_metadata_do] > >>> >>> 0-androidpolice_data3-replicate-0: performing metadata selfheal > >on > >>> >>> 2692eeba-1ebe-49b6-927f-1dfbcd227591 > >>> >>> [2020-04-30 21:46:36.006607] I [MSGID: 108026] > >>> >>> [afr-self-heal-common.c:1723:afr_log_selfheal] > >>> >>> 0-androidpolice_data3-replicate-0: Completed metadata selfheal > >on > >>> >>> 2692eeba-1ebe-49b6-927f-1dfbcd227591. sources=0 1 [2] sinks=3 > >>> >>> [2020-04-30 21:46:37.245599] E > >>> >[fuse-bridge.c:219:check_and_dump_fuse_W] > >>> >>> (--> > >>> > >>/usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x17d)[0x7fd13d50624d] > >>> >>> (--> > >>> >>> > >>> > >>/usr/lib64/glusterfs/5.13/xlator/mount/fuse.so(+0x849a)[0x7fd1398e949a] > >>> >>> (--> > >>> >>> > >>> > >>/usr/lib64/glusterfs/5.13/xlator/mount/fuse.so(+0x87bb)[0x7fd1398e97bb] > >>> >>> (--> /lib64/libpthread.so.0(+0x84f9)[0x7fd13ca564f9] (--> > >>> >>> /lib64/libc.so.6(clone+0x3f)[0x7fd13c78ef2f] ))))) > >0-glusterfs-fuse: > >>> >>> writing to fuse device failed: No such file or directory > >>> >>> [2020-04-30 21:46:50.864797] E > >>> >[fuse-bridge.c:219:check_and_dump_fuse_W] > >>> >>> (--> > >>> > >>/usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x17d)[0x7fd13d50624d] > >>> >>> (--> > >>> >>> > >>> > >>/usr/lib64/glusterfs/5.13/xlator/mount/fuse.so(+0x849a)[0x7fd1398e949a] > >>> >>> (--> > >>> >>> > >>> > >>/usr/lib64/glusterfs/5.13/xlator/mount/fuse.so(+0x87bb)[0x7fd1398e97bb] > >>> >>> (--> /lib64/libpthread.so.0(+0x84f9)[0x7fd13ca564f9] (--> > >>> >>> /lib64/libc.so.6(clone+0x3f)[0x7fd13c78ef2f] ))))) > >0-glusterfs-fuse: > >>> >>> writing to fuse device failed: No such file or directory > >>> >>> > >>> >>> The number of items being healed is going up and down wildly, > >from 0 > >>> >to > >>> >>> 8000+ and sometimes taking a really long time to return a value. > >I'm > >>> >really > >>> >>> worried as this is a production system, and I didn't observe > >this in > >>> >our > >>> >>> test system. > >>> >>> > >>> >>> > >>> >>> > >>> >>> gluster v heal apkmirror_data1 info summary > >>> >>> Brick nexus2:/mnt/nexus2_block1/apkmirror_data1 > >>> >>> Status: Connected > >>> >>> Total Number of entries: 27 > >>> >>> Number of entries in heal pending: 27 > >>> >>> Number of entries in split-brain: 0 > >>> >>> Number of entries possibly healing: 0 > >>> >>> > >>> >>> Brick forge:/mnt/forge_block1/apkmirror_data1 > >>> >>> Status: Connected > >>> >>> Total Number of entries: 27 > >>> >>> Number of entries in heal pending: 27 > >>> >>> Number of entries in split-brain: 0 > >>> >>> Number of entries possibly healing: 0 > >>> >>> > >>> >>> Brick hive:/mnt/hive_block1/apkmirror_data1 > >>> >>> Status: Connected > >>> >>> Total Number of entries: 27 > >>> >>> Number of entries in heal pending: 27 > >>> >>> Number of entries in split-brain: 0 > >>> >>> Number of entries possibly healing: 0 > >>> >>> > >>> >>> Brick citadel:/mnt/citadel_block1/apkmirror_data1 > >>> >>> Status: Connected > >>> >>> Total Number of entries: 8540 > >>> >>> Number of entries in heal pending: 8540 > >>> >>> Number of entries in split-brain: 0 > >>> >>> Number of entries possibly healing: 0 > >>> >>> > >>> >>> > >>> >>> > >>> >>> gluster v heal androidpolice_data3 info summary > >>> >>> Brick nexus2:/mnt/nexus2_block4/androidpolice_data3 > >>> >>> Status: Connected > >>> >>> Total Number of entries: 1 > >>> >>> Number of entries in heal pending: 1 > >>> >>> Number of entries in split-brain: 0 > >>> >>> Number of entries possibly healing: 0 > >>> >>> > >>> >>> Brick forge:/mnt/forge_block4/androidpolice_data3 > >>> >>> Status: Connected > >>> >>> Total Number of entries: 1 > >>> >>> Number of entries in heal pending: 1 > >>> >>> Number of entries in split-brain: 0 > >>> >>> Number of entries possibly healing: 0 > >>> >>> > >>> >>> Brick hive:/mnt/hive_block4/androidpolice_data3 > >>> >>> Status: Connected > >>> >>> Total Number of entries: 1 > >>> >>> Number of entries in heal pending: 1 > >>> >>> Number of entries in split-brain: 0 > >>> >>> Number of entries possibly healing: 0 > >>> >>> > >>> >>> Brick citadel:/mnt/citadel_block4/androidpolice_data3 > >>> >>> Status: Connected > >>> >>> Total Number of entries: 1149 > >>> >>> Number of entries in heal pending: 1149 > >>> >>> Number of entries in split-brain: 0 > >>> >>> Number of entries possibly healing: 0 > >>> >>> > >>> >>> > >>> >>> What should I do at this point? The files I tested seem to be > >>> >replicating > >>> >>> correctly, but I don't know if it's the case for all of them, > >and > >>> >the heals > >>> >>> going up and down, and all these log messages are making me very > >>> >nervous. > >>> >>> > >>> >>> Thank you. > >>> >>> > >>> >>> Sincerely, > >>> >>> Artem > >>> >>> > >>> >>> -- > >>> >>> Founder, Android Police <http://www.androidpolice.com>, APK > >Mirror > >>> >>> <http://www.apkmirror.com/>, Illogical Robot LLC > >>> >>> beerpla.net | @ArtemR <http://twitter.com/ArtemR> > >>> >>> > >>> >> > >>> > >>> I's not supported , but usually it works. > >>> > >>> In worst case scenario, you can remove the node, wipe gluster on > >the > >>> node, reinstall the packages and add it - it will require full heal > >of the > >>> brick and as you have previously reported could lead to performance > >>> degradation. > >>> > >>> I think you are on SLES, but I could be wrong . Do you have btrfs or > >LVM > >>> snapshots to revert from ? > >>> > >>> Best Regards, > >>> Strahil Nikolov > >>> > >> > > Hi Artem, > > You can increase the brick log level following > https://access.redhat.com/documentation/en-us/red_hat_gluster_storage/3/html/administration_guide/configuring_the_log_level > but keep in mind that logs grow quite fast - so don't keep them above the > current level for too much time. > > > Do you have a geo replication running ? > > About the migration issue - I have no clue why this happened. Last time I > skipped a major release(3.12 to 5.5) I got a huge trouble (all files > ownership was switched to root) and I have the feeling that it won't > happen again if you go through v6. > > Best Regards, > Strahil Nikolov >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20200502/a37ed1d0/attachment.html>
Strahil Nikolov
2020-May-03 05:52 UTC
[Gluster-users] Upgrade from 5.13 to 7.5 full of weird messages
On May 2, 2020 8:18:38 PM GMT+03:00, Artem Russakovskii <archon810 at gmail.com> wrote:>I don't have geo replication. > >Still waiting for someone from the gluster team to chime in. They used >to >be a lot more responsive here. Do you know if there is a holiday >perhaps, >or have the working hours been cut due to Coronavirus currently? > >I'm not inclined to try a v6 upgrade without their word first. > >On Sat, May 2, 2020, 12:47 AM Strahil Nikolov <hunter86_bg at yahoo.com> >wrote: > >> On May 1, 2020 8:03:50 PM GMT+03:00, Artem Russakovskii < >> archon810 at gmail.com> wrote: >> >The good news is the downgrade seems to have worked and was >painless. >> > >> >zypper install --oldpackage glusterfs-5.13, restart gluster, and >almost >> >immediately there are no heal pending entries anymore. >> > >> >The only things still showing up in the logs, besides some healing >is >> >0-glusterfs-fuse: >> >writing to fuse device failed: No such file or directory: >> >==> mnt-androidpolice_data3.log <=>> >[2020-05-01 16:54:21.085643] E >> >[fuse-bridge.c:219:check_and_dump_fuse_W] >> >(--> >> >>/usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x17d)[0x7fd13d50624d] >> >(--> >> >>/usr/lib64/glusterfs/5.13/xlator/mount/fuse.so(+0x849a)[0x7fd1398e949a] >> >(--> >> >>/usr/lib64/glusterfs/5.13/xlator/mount/fuse.so(+0x87bb)[0x7fd1398e97bb] >> >(--> /lib64/libpthread.so.0(+0x84f9)[0x7fd13ca564f9] (--> >> >/lib64/libc.so.6(clone+0x3f)[0x7fd13c78ef2f] ))))) 0-glusterfs-fuse: >> >writing to fuse device failed: No such file or directory >> >==> mnt-apkmirror_data1.log <=>> >[2020-05-01 16:54:21.268842] E >> >[fuse-bridge.c:219:check_and_dump_fuse_W] >> >(--> >> >>/usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x17d)[0x7fdf2b0a624d] >> >(--> >> >>/usr/lib64/glusterfs/5.13/xlator/mount/fuse.so(+0x849a)[0x7fdf2748949a] >> >(--> >> >>/usr/lib64/glusterfs/5.13/xlator/mount/fuse.so(+0x87bb)[0x7fdf274897bb] >> >(--> /lib64/libpthread.so.0(+0x84f9)[0x7fdf2a5f64f9] (--> >> >/lib64/libc.so.6(clone+0x3f)[0x7fdf2a32ef2f] ))))) 0-glusterfs-fuse: >> >writing to fuse device failed: No such file or directory >> > >> >It'd be very helpful if it had more info about what failed to write >and >> >why. >> > >> >I'd still really love to see the analysis of this failed upgrade >from >> >core >> >gluster maintainers to see what needs fixing and how we can upgrade >in >> >the >> >future. >> > >> >Thanks. >> > >> >Sincerely, >> >Artem >> > >> >-- >> >Founder, Android Police <http://www.androidpolice.com>, APK Mirror >> ><http://www.apkmirror.com/>, Illogical Robot LLC >> >beerpla.net | @ArtemR <http://twitter.com/ArtemR> >> > >> > >> >On Fri, May 1, 2020 at 7:25 AM Artem Russakovskii ><archon810 at gmail.com> >> >wrote: >> > >> >> I do not have snapshots, no. I have a general file based backup, >but >> >also >> >> the other 3 nodes are up. >> >> >> >> OpenSUSE 15.1. >> >> >> >> If I try to downgrade and it doesn't work, what's the brick >> >replacement >> >> scenario - is this still accurate? >> >> >> > >> >https://docs.gluster.org/en/latest/Administrator%20Guide/Managing%20Volumes/#replace-brick >> >> >> >> Any feedback about the issues themselves yet please? Specifically, >is >> >> there a chance this is happening because of the mismatched gluster >> >> versions? Though, what's the solution then? >> >> >> >> On Fri, May 1, 2020, 1:07 AM Strahil Nikolov ><hunter86_bg at yahoo.com> >> >> wrote: >> >> >> >>> On May 1, 2020 1:25:17 AM GMT+03:00, Artem Russakovskii < >> >>> archon810 at gmail.com> wrote: >> >>> >If more time is needed to analyze this, is this an option? Shut >> >down >> >>> >7.5, >> >>> >downgrade it back to 5.13 and restart, or would this screw >> >something up >> >>> >badly? I didn't up the op-version yet. >> >>> > >> >>> >Thanks. >> >>> > >> >>> >Sincerely, >> >>> >Artem >> >>> > >> >>> >-- >> >>> >Founder, Android Police <http://www.androidpolice.com>, APK >Mirror >> >>> ><http://www.apkmirror.com/>, Illogical Robot LLC >> >>> >beerpla.net | @ArtemR <http://twitter.com/ArtemR> >> >>> > >> >>> > >> >>> >On Thu, Apr 30, 2020 at 3:13 PM Artem Russakovskii >> >>> ><archon810 at gmail.com> >> >>> >wrote: >> >>> > >> >>> >> The number of heal pending on citadel, the one that was >upgraded >> >to >> >>> >7.5, >> >>> >> has now gone to 10s of thousands and continues to go up. >> >>> >> >> >>> >> Sincerely, >> >>> >> Artem >> >>> >> >> >>> >> -- >> >>> >> Founder, Android Police <http://www.androidpolice.com>, APK >> >Mirror >> >>> >> <http://www.apkmirror.com/>, Illogical Robot LLC >> >>> >> beerpla.net | @ArtemR <http://twitter.com/ArtemR> >> >>> >> >> >>> >> >> >>> >> On Thu, Apr 30, 2020 at 2:57 PM Artem Russakovskii >> >>> ><archon810 at gmail.com> >> >>> >> wrote: >> >>> >> >> >>> >>> Hi all, >> >>> >>> >> >>> >>> Today, I decided to upgrade one of the four servers (citadel) >we >> >>> >have to >> >>> >>> 7.5 from 5.13. There are 2 volumes, 1x4 replicate, and fuse >> >mounts >> >>> >(I sent >> >>> >>> the full details earlier in another message). If everything >> >looked >> >>> >OK, I >> >>> >>> would have proceeded the rolling upgrade for all of them, >> >following >> >>> >the >> >>> >>> full heal. >> >>> >>> >> >>> >>> However, as soon as I upgraded and restarted, the logs filled >> >with >> >>> >>> messages like these: >> >>> >>> >> >>> >>> [2020-04-30 21:39:21.316149] E >> >>> >>> [rpcsvc.c:567:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc >actor >> >>> >>> (1298437:400:17) failed to complete successfully >> >>> >>> [2020-04-30 21:39:21.382891] E >> >>> >>> [rpcsvc.c:567:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc >actor >> >>> >>> (1298437:400:17) failed to complete successfully >> >>> >>> [2020-04-30 21:39:21.442440] E >> >>> >>> [rpcsvc.c:567:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc >actor >> >>> >>> (1298437:400:17) failed to complete successfully >> >>> >>> [2020-04-30 21:39:21.445587] E >> >>> >>> [rpcsvc.c:567:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc >actor >> >>> >>> (1298437:400:17) failed to complete successfully >> >>> >>> [2020-04-30 21:39:21.571398] E >> >>> >>> [rpcsvc.c:567:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc >actor >> >>> >>> (1298437:400:17) failed to complete successfully >> >>> >>> [2020-04-30 21:39:21.668192] E >> >>> >>> [rpcsvc.c:567:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc >actor >> >>> >>> (1298437:400:17) failed to complete successfully >> >>> >>> >> >>> >>> >> >>> >>> The message "I [MSGID: 108031] >> >>> >>> [afr-common.c:2581:afr_local_discovery_cbk] >> >>> >>> 0-androidpolice_data3-replicate-0: selecting local read_child >> >>> >>> androidpolice_data3-client-3" repeated 10 times between >> >[2020-04-30 >> >>> >>> 21:46:41.854675] and [2020-04-30 21:48:20.206323] >> >>> >>> The message "W [MSGID: 114031] >> >>> >>> [client-rpc-fops_v2.c:850:client4_0_setxattr_cbk] >> >>> >>> 0-androidpolice_data3-client-1: remote operation failed >> >[Transport >> >>> >endpoint >> >>> >>> is not connected]" repeated 264 times between [2020-04-30 >> >>> >21:46:32.129567] >> >>> >>> and [2020-04-30 21:48:29.905008] >> >>> >>> The message "W [MSGID: 114031] >> >>> >>> [client-rpc-fops_v2.c:850:client4_0_setxattr_cbk] >> >>> >>> 0-androidpolice_data3-client-0: remote operation failed >> >[Transport >> >>> >endpoint >> >>> >>> is not connected]" repeated 264 times between [2020-04-30 >> >>> >21:46:32.129602] >> >>> >>> and [2020-04-30 21:48:29.905040] >> >>> >>> The message "W [MSGID: 114031] >> >>> >>> [client-rpc-fops_v2.c:850:client4_0_setxattr_cbk] >> >>> >>> 0-androidpolice_data3-client-2: remote operation failed >> >[Transport >> >>> >endpoint >> >>> >>> is not connected]" repeated 264 times between [2020-04-30 >> >>> >21:46:32.129512] >> >>> >>> and [2020-04-30 21:48:29.905047] >> >>> >>> >> >>> >>> >> >>> >>> >> >>> >>> Once in a while, I'm seeing this: >> >>> >>> ==> bricks/mnt-hive_block4-androidpolice_data3.log <=>> >>> >>> [2020-04-30 21:45:54.251637] I [MSGID: 115072] >> >>> >>> [server-rpc-fops_v2.c:1681:server4_setattr_cbk] >> >>> >>> 0-androidpolice_data3-server: 5725811: SETATTR / >> >>> >>> >> >>> > >> >>> >> > >> >androidpolice.com/public/wp-content/uploads/2019/03/cielo-breez-plus-hero.png >> >>> >>> (d4556eb4-f15b-412c-a42a-32b4438af557), client: >> >>> >>> >> >>> >> >>> >> >> >>>CTX_ID:32e2d636-038a-472d-8199-007555d1805f-GRAPH_ID:0-PID:14265-HOST:nexus2-PC_NAME:androidpolice_data3-client-2-RECON_NO:-1, >> >>> >>> error-xlator: androidpolice_data3-access-control [Operation >not >> >>> >permitted] >> >>> >>> [2020-04-30 21:49:10.439701] I [MSGID: 115072] >> >>> >>> [server-rpc-fops_v2.c:1680:server4_setattr_cbk] >> >>> >>> 0-androidpolice_data3-server: 201833: SETATTR / >> >>> >>> androidpolice.com/public/wp-content/uploads >> >>> >>> (2692eeba-1ebe-49b6-927f-1dfbcd227591), client: >> >>> >>> >> >>> >> >>> >> >> >>>CTX_ID:af341e80-70ff-4d23-99ef-3d846a546fc9-GRAPH_ID:0-PID:2358-HOST:forge-PC_NAME:androidpolice_data3-client-3-RECON_NO:-2, >> >>> >>> error-xlator: androidpolice_data3-access-control [Operation >not >> >>> >permitted] >> >>> >>> [2020-04-30 21:49:10.453724] I [MSGID: 115072] >> >>> >>> [server-rpc-fops_v2.c:1680:server4_setattr_cbk] >> >>> >>> 0-androidpolice_data3-server: 201842: SETATTR / >> >>> >>> androidpolice.com/public/wp-content/uploads >> >>> >>> (2692eeba-1ebe-49b6-927f-1dfbcd227591), client: >> >>> >>> >> >>> >> >>> >> >> >>>CTX_ID:af341e80-70ff-4d23-99ef-3d846a546fc9-GRAPH_ID:0-PID:2358-HOST:forge-PC_NAME:androidpolice_data3-client-3-RECON_NO:-2, >> >>> >>> error-xlator: androidpolice_data3-access-control [Operation >not >> >>> >permitted] >> >>> >>> [2020-04-30 21:49:16.224662] I [MSGID: 115072] >> >>> >>> [server-rpc-fops_v2.c:1680:server4_setattr_cbk] >> >>> >>> 0-androidpolice_data3-server: 202865: SETATTR / >> >>> >>> androidpolice.com/public/wp-content/uploads >> >>> >>> (2692eeba-1ebe-49b6-927f-1dfbcd227591), client: >> >>> >>> >> >>> >> >>> >> >> >>>CTX_ID:32e2d636-038a-472d-8199-007555d1805f-GRAPH_ID:0-PID:14265-HOST:nexus2-PC_NAME:androidpolice_data3-client-3-RECON_NO:-2, >> >>> >>> error-xlator: androidpolice_data3-access-control [Operation >not >> >>> >permitted] >> >>> >>> >> >>> >>> There's also lots of self-healing happening that I didn't >expect >> >at >> >>> >all, >> >>> >>> since the upgrade only took ~10-15s. >> >>> >>> [2020-04-30 21:47:38.714448] I [MSGID: 108026] >> >>> >>> [afr-self-heal-metadata.c:52:__afr_selfheal_metadata_do] >> >>> >>> 0-apkmirror_data1-replicate-0: performing metadata selfheal >on >> >>> >>> 4a6ba2d7-7ad8-4113-862b-02e4934a3461 >> >>> >>> [2020-04-30 21:47:38.765033] I [MSGID: 108026] >> >>> >>> [afr-self-heal-common.c:1723:afr_log_selfheal] >> >>> >>> 0-apkmirror_data1-replicate-0: Completed metadata selfheal on >> >>> >>> 4a6ba2d7-7ad8-4113-862b-02e4934a3461. sources=[3] sinks=0 1 >2 >> >>> >>> [2020-04-30 21:47:38.765289] I [MSGID: 108026] >> >>> >>> [afr-self-heal-metadata.c:52:__afr_selfheal_metadata_do] >> >>> >>> 0-apkmirror_data1-replicate-0: performing metadata selfheal >on >> >>> >>> f3c62a41-1864-4e75-9883-4357a7091296 >> >>> >>> [2020-04-30 21:47:38.800987] I [MSGID: 108026] >> >>> >>> [afr-self-heal-common.c:1723:afr_log_selfheal] >> >>> >>> 0-apkmirror_data1-replicate-0: Completed metadata selfheal on >> >>> >>> f3c62a41-1864-4e75-9883-4357a7091296. sources=[3] sinks=0 1 >2 >> >>> >>> >> >>> >>> >> >>> >>> I'm also seeing "remote operation failed" and "writing to >fuse >> >>> >device >> >>> >>> failed: No such file or directory" messages >> >>> >>> [2020-04-30 21:46:34.891957] I [MSGID: 108026] >> >>> >>> [afr-self-heal-common.c:1723:afr_log_selfheal] >> >>> >>> 0-androidpolice_data3-replicate-0: Completed metadata >selfheal >> >on >> >>> >>> 2692eeba-1ebe-49b6-927f-1dfbcd227591. sources=0 1 [2] >sinks=3 >> >>> >>> [2020-04-30 21:45:36.127412] W [MSGID: 114031] >> >>> >>> [client-rpc-fops_v2.c:1985:client4_0_setattr_cbk] >> >>> >>> 0-androidpolice_data3-client-0: remote operation failed >> >[Operation >> >>> >not >> >>> >>> permitted] >> >>> >>> [2020-04-30 21:45:36.345924] W [MSGID: 114031] >> >>> >>> [client-rpc-fops_v2.c:1985:client4_0_setattr_cbk] >> >>> >>> 0-androidpolice_data3-client-1: remote operation failed >> >[Operation >> >>> >not >> >>> >>> permitted] >> >>> >>> [2020-04-30 21:46:35.291853] I [MSGID: 108031] >> >>> >>> [afr-common.c:2543:afr_local_discovery_cbk] >> >>> >>> 0-androidpolice_data3-replicate-0: selecting local read_child >> >>> >>> androidpolice_data3-client-2 >> >>> >>> [2020-04-30 21:46:35.977342] I [MSGID: 108026] >> >>> >>> [afr-self-heal-metadata.c:52:__afr_selfheal_metadata_do] >> >>> >>> 0-androidpolice_data3-replicate-0: performing metadata >selfheal >> >on >> >>> >>> 2692eeba-1ebe-49b6-927f-1dfbcd227591 >> >>> >>> [2020-04-30 21:46:36.006607] I [MSGID: 108026] >> >>> >>> [afr-self-heal-common.c:1723:afr_log_selfheal] >> >>> >>> 0-androidpolice_data3-replicate-0: Completed metadata >selfheal >> >on >> >>> >>> 2692eeba-1ebe-49b6-927f-1dfbcd227591. sources=0 1 [2] >sinks=3 >> >>> >>> [2020-04-30 21:46:37.245599] E >> >>> >[fuse-bridge.c:219:check_and_dump_fuse_W] >> >>> >>> (--> >> >>> >> >>>/usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x17d)[0x7fd13d50624d] >> >>> >>> (--> >> >>> >>> >> >>> >> >>>/usr/lib64/glusterfs/5.13/xlator/mount/fuse.so(+0x849a)[0x7fd1398e949a] >> >>> >>> (--> >> >>> >>> >> >>> >> >>>/usr/lib64/glusterfs/5.13/xlator/mount/fuse.so(+0x87bb)[0x7fd1398e97bb] >> >>> >>> (--> /lib64/libpthread.so.0(+0x84f9)[0x7fd13ca564f9] (--> >> >>> >>> /lib64/libc.so.6(clone+0x3f)[0x7fd13c78ef2f] ))))) >> >0-glusterfs-fuse: >> >>> >>> writing to fuse device failed: No such file or directory >> >>> >>> [2020-04-30 21:46:50.864797] E >> >>> >[fuse-bridge.c:219:check_and_dump_fuse_W] >> >>> >>> (--> >> >>> >> >>>/usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x17d)[0x7fd13d50624d] >> >>> >>> (--> >> >>> >>> >> >>> >> >>>/usr/lib64/glusterfs/5.13/xlator/mount/fuse.so(+0x849a)[0x7fd1398e949a] >> >>> >>> (--> >> >>> >>> >> >>> >> >>>/usr/lib64/glusterfs/5.13/xlator/mount/fuse.so(+0x87bb)[0x7fd1398e97bb] >> >>> >>> (--> /lib64/libpthread.so.0(+0x84f9)[0x7fd13ca564f9] (--> >> >>> >>> /lib64/libc.so.6(clone+0x3f)[0x7fd13c78ef2f] ))))) >> >0-glusterfs-fuse: >> >>> >>> writing to fuse device failed: No such file or directory >> >>> >>> >> >>> >>> The number of items being healed is going up and down wildly, >> >from 0 >> >>> >to >> >>> >>> 8000+ and sometimes taking a really long time to return a >value. >> >I'm >> >>> >really >> >>> >>> worried as this is a production system, and I didn't observe >> >this in >> >>> >our >> >>> >>> test system. >> >>> >>> >> >>> >>> >> >>> >>> >> >>> >>> gluster v heal apkmirror_data1 info summary >> >>> >>> Brick nexus2:/mnt/nexus2_block1/apkmirror_data1 >> >>> >>> Status: Connected >> >>> >>> Total Number of entries: 27 >> >>> >>> Number of entries in heal pending: 27 >> >>> >>> Number of entries in split-brain: 0 >> >>> >>> Number of entries possibly healing: 0 >> >>> >>> >> >>> >>> Brick forge:/mnt/forge_block1/apkmirror_data1 >> >>> >>> Status: Connected >> >>> >>> Total Number of entries: 27 >> >>> >>> Number of entries in heal pending: 27 >> >>> >>> Number of entries in split-brain: 0 >> >>> >>> Number of entries possibly healing: 0 >> >>> >>> >> >>> >>> Brick hive:/mnt/hive_block1/apkmirror_data1 >> >>> >>> Status: Connected >> >>> >>> Total Number of entries: 27 >> >>> >>> Number of entries in heal pending: 27 >> >>> >>> Number of entries in split-brain: 0 >> >>> >>> Number of entries possibly healing: 0 >> >>> >>> >> >>> >>> Brick citadel:/mnt/citadel_block1/apkmirror_data1 >> >>> >>> Status: Connected >> >>> >>> Total Number of entries: 8540 >> >>> >>> Number of entries in heal pending: 8540 >> >>> >>> Number of entries in split-brain: 0 >> >>> >>> Number of entries possibly healing: 0 >> >>> >>> >> >>> >>> >> >>> >>> >> >>> >>> gluster v heal androidpolice_data3 info summary >> >>> >>> Brick nexus2:/mnt/nexus2_block4/androidpolice_data3 >> >>> >>> Status: Connected >> >>> >>> Total Number of entries: 1 >> >>> >>> Number of entries in heal pending: 1 >> >>> >>> Number of entries in split-brain: 0 >> >>> >>> Number of entries possibly healing: 0 >> >>> >>> >> >>> >>> Brick forge:/mnt/forge_block4/androidpolice_data3 >> >>> >>> Status: Connected >> >>> >>> Total Number of entries: 1 >> >>> >>> Number of entries in heal pending: 1 >> >>> >>> Number of entries in split-brain: 0 >> >>> >>> Number of entries possibly healing: 0 >> >>> >>> >> >>> >>> Brick hive:/mnt/hive_block4/androidpolice_data3 >> >>> >>> Status: Connected >> >>> >>> Total Number of entries: 1 >> >>> >>> Number of entries in heal pending: 1 >> >>> >>> Number of entries in split-brain: 0 >> >>> >>> Number of entries possibly healing: 0 >> >>> >>> >> >>> >>> Brick citadel:/mnt/citadel_block4/androidpolice_data3 >> >>> >>> Status: Connected >> >>> >>> Total Number of entries: 1149 >> >>> >>> Number of entries in heal pending: 1149 >> >>> >>> Number of entries in split-brain: 0 >> >>> >>> Number of entries possibly healing: 0 >> >>> >>> >> >>> >>> >> >>> >>> What should I do at this point? The files I tested seem to be >> >>> >replicating >> >>> >>> correctly, but I don't know if it's the case for all of them, >> >and >> >>> >the heals >> >>> >>> going up and down, and all these log messages are making me >very >> >>> >nervous. >> >>> >>> >> >>> >>> Thank you. >> >>> >>> >> >>> >>> Sincerely, >> >>> >>> Artem >> >>> >>> >> >>> >>> -- >> >>> >>> Founder, Android Police <http://www.androidpolice.com>, APK >> >Mirror >> >>> >>> <http://www.apkmirror.com/>, Illogical Robot LLC >> >>> >>> beerpla.net | @ArtemR <http://twitter.com/ArtemR> >> >>> >>> >> >>> >> >> >>> >> >>> I's not supported , but usually it works. >> >>> >> >>> In worst case scenario, you can remove the node, wipe gluster on >> >the >> >>> node, reinstall the packages and add it - it will require full >heal >> >of the >> >>> brick and as you have previously reported could lead to >performance >> >>> degradation. >> >>> >> >>> I think you are on SLES, but I could be wrong . Do you have btrfs >or >> >LVM >> >>> snapshots to revert from ? >> >>> >> >>> Best Regards, >> >>> Strahil Nikolov >> >>> >> >> >> >> Hi Artem, >> >> You can increase the brick log level following >> >https://access.redhat.com/documentation/en-us/red_hat_gluster_storage/3/html/administration_guide/configuring_the_log_level >> but keep in mind that logs grow quite fast - so don't keep them above >the >> current level for too much time. >> >> >> Do you have a geo replication running ? >> >> About the migration issue - I have no clue why this happened. Last >time I >> skipped a major release(3.12 to 5.5) I got a huge trouble (all files >> ownership was switched to root) and I have the feeling that it >won't >> happen again if you go through v6. >> >> Best Regards, >> Strahil Nikolov >>Hi Artem, 1st of May is an international holliday , while 6th of May is a holliday for some contries. I guess they will join on monday. Best Regards, Strahil Nikolov
Amar Tumballi
2020-May-04 13:26 UTC
[Gluster-users] Upgrade from 5.13 to 7.5 full of weird messages
On Sat, May 2, 2020 at 10:49 PM Artem Russakovskii <archon810 at gmail.com> wrote:> I don't have geo replication. > > Still waiting for someone from the gluster team to chime in. They used to > be a lot more responsive here. Do you know if there is a holiday perhaps, > or have the working hours been cut due to Coronavirus currently? > >It was Holiday on May 1st, and 2nd and 3rd were Weekend days! And also I guess many of Developers from Red Hat were attending Virtual Summit!> I'm not inclined to try a v6 upgrade without their word first. >Fair bet! I will bring this topic in one of the community meetings, and ask developers if they have some feedback! I personally have not seen these errors, and don't have a hunch on which patch would have caused an increase in logs! -Amar> > On Sat, May 2, 2020, 12:47 AM Strahil Nikolov <hunter86_bg at yahoo.com> > wrote: > >> On May 1, 2020 8:03:50 PM GMT+03:00, Artem Russakovskii < >> archon810 at gmail.com> wrote: >> >The good news is the downgrade seems to have worked and was painless. >> > >> >zypper install --oldpackage glusterfs-5.13, restart gluster, and almost >> >immediately there are no heal pending entries anymore. >> > >> >The only things still showing up in the logs, besides some healing is >> >0-glusterfs-fuse: >> >writing to fuse device failed: No such file or directory: >> >==> mnt-androidpolice_data3.log <=>> >[2020-05-01 16:54:21.085643] E >> >[fuse-bridge.c:219:check_and_dump_fuse_W] >> >(--> >> >/usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x17d)[0x7fd13d50624d] >> >(--> >> >/usr/lib64/glusterfs/5.13/xlator/mount/fuse.so(+0x849a)[0x7fd1398e949a] >> >(--> >> >/usr/lib64/glusterfs/5.13/xlator/mount/fuse.so(+0x87bb)[0x7fd1398e97bb] >> >(--> /lib64/libpthread.so.0(+0x84f9)[0x7fd13ca564f9] (--> >> >/lib64/libc.so.6(clone+0x3f)[0x7fd13c78ef2f] ))))) 0-glusterfs-fuse: >> >writing to fuse device failed: No such file or directory >> >==> mnt-apkmirror_data1.log <=>> >[2020-05-01 16:54:21.268842] E >> >[fuse-bridge.c:219:check_and_dump_fuse_W] >> >(--> >> >/usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x17d)[0x7fdf2b0a624d] >> >(--> >> >/usr/lib64/glusterfs/5.13/xlator/mount/fuse.so(+0x849a)[0x7fdf2748949a] >> >(--> >> >/usr/lib64/glusterfs/5.13/xlator/mount/fuse.so(+0x87bb)[0x7fdf274897bb] >> >(--> /lib64/libpthread.so.0(+0x84f9)[0x7fdf2a5f64f9] (--> >> >/lib64/libc.so.6(clone+0x3f)[0x7fdf2a32ef2f] ))))) 0-glusterfs-fuse: >> >writing to fuse device failed: No such file or directory >> > >> >It'd be very helpful if it had more info about what failed to write and >> >why. >> > >> >I'd still really love to see the analysis of this failed upgrade from >> >core >> >gluster maintainers to see what needs fixing and how we can upgrade in >> >the >> >future. >> > >> >Thanks. >> > >> >Sincerely, >> >Artem >> > >> >-- >> >Founder, Android Police <http://www.androidpolice.com>, APK Mirror >> ><http://www.apkmirror.com/>, Illogical Robot LLC >> >beerpla.net | @ArtemR <http://twitter.com/ArtemR> >> > >> > >> >On Fri, May 1, 2020 at 7:25 AM Artem Russakovskii <archon810 at gmail.com> >> >wrote: >> > >> >> I do not have snapshots, no. I have a general file based backup, but >> >also >> >> the other 3 nodes are up. >> >> >> >> OpenSUSE 15.1. >> >> >> >> If I try to downgrade and it doesn't work, what's the brick >> >replacement >> >> scenario - is this still accurate? >> >> >> > >> https://docs.gluster.org/en/latest/Administrator%20Guide/Managing%20Volumes/#replace-brick >> >> >> >> Any feedback about the issues themselves yet please? Specifically, is >> >> there a chance this is happening because of the mismatched gluster >> >> versions? Though, what's the solution then? >> >> >> >> On Fri, May 1, 2020, 1:07 AM Strahil Nikolov <hunter86_bg at yahoo.com> >> >> wrote: >> >> >> >>> On May 1, 2020 1:25:17 AM GMT+03:00, Artem Russakovskii < >> >>> archon810 at gmail.com> wrote: >> >>> >If more time is needed to analyze this, is this an option? Shut >> >down >> >>> >7.5, >> >>> >downgrade it back to 5.13 and restart, or would this screw >> >something up >> >>> >badly? I didn't up the op-version yet. >> >>> > >> >>> >Thanks. >> >>> > >> >>> >Sincerely, >> >>> >Artem >> >>> > >> >>> >-- >> >>> >Founder, Android Police <http://www.androidpolice.com>, APK Mirror >> >>> ><http://www.apkmirror.com/>, Illogical Robot LLC >> >>> >beerpla.net | @ArtemR <http://twitter.com/ArtemR> >> >>> > >> >>> > >> >>> >On Thu, Apr 30, 2020 at 3:13 PM Artem Russakovskii >> >>> ><archon810 at gmail.com> >> >>> >wrote: >> >>> > >> >>> >> The number of heal pending on citadel, the one that was upgraded >> >to >> >>> >7.5, >> >>> >> has now gone to 10s of thousands and continues to go up. >> >>> >> >> >>> >> Sincerely, >> >>> >> Artem >> >>> >> >> >>> >> -- >> >>> >> Founder, Android Police <http://www.androidpolice.com>, APK >> >Mirror >> >>> >> <http://www.apkmirror.com/>, Illogical Robot LLC >> >>> >> beerpla.net | @ArtemR <http://twitter.com/ArtemR> >> >>> >> >> >>> >> >> >>> >> On Thu, Apr 30, 2020 at 2:57 PM Artem Russakovskii >> >>> ><archon810 at gmail.com> >> >>> >> wrote: >> >>> >> >> >>> >>> Hi all, >> >>> >>> >> >>> >>> Today, I decided to upgrade one of the four servers (citadel) we >> >>> >have to >> >>> >>> 7.5 from 5.13. There are 2 volumes, 1x4 replicate, and fuse >> >mounts >> >>> >(I sent >> >>> >>> the full details earlier in another message). If everything >> >looked >> >>> >OK, I >> >>> >>> would have proceeded the rolling upgrade for all of them, >> >following >> >>> >the >> >>> >>> full heal. >> >>> >>> >> >>> >>> However, as soon as I upgraded and restarted, the logs filled >> >with >> >>> >>> messages like these: >> >>> >>> >> >>> >>> [2020-04-30 21:39:21.316149] E >> >>> >>> [rpcsvc.c:567:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor >> >>> >>> (1298437:400:17) failed to complete successfully >> >>> >>> [2020-04-30 21:39:21.382891] E >> >>> >>> [rpcsvc.c:567:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor >> >>> >>> (1298437:400:17) failed to complete successfully >> >>> >>> [2020-04-30 21:39:21.442440] E >> >>> >>> [rpcsvc.c:567:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor >> >>> >>> (1298437:400:17) failed to complete successfully >> >>> >>> [2020-04-30 21:39:21.445587] E >> >>> >>> [rpcsvc.c:567:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor >> >>> >>> (1298437:400:17) failed to complete successfully >> >>> >>> [2020-04-30 21:39:21.571398] E >> >>> >>> [rpcsvc.c:567:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor >> >>> >>> (1298437:400:17) failed to complete successfully >> >>> >>> [2020-04-30 21:39:21.668192] E >> >>> >>> [rpcsvc.c:567:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor >> >>> >>> (1298437:400:17) failed to complete successfully >> >>> >>> >> >>> >>> >> >>> >>> The message "I [MSGID: 108031] >> >>> >>> [afr-common.c:2581:afr_local_discovery_cbk] >> >>> >>> 0-androidpolice_data3-replicate-0: selecting local read_child >> >>> >>> androidpolice_data3-client-3" repeated 10 times between >> >[2020-04-30 >> >>> >>> 21:46:41.854675] and [2020-04-30 21:48:20.206323] >> >>> >>> The message "W [MSGID: 114031] >> >>> >>> [client-rpc-fops_v2.c:850:client4_0_setxattr_cbk] >> >>> >>> 0-androidpolice_data3-client-1: remote operation failed >> >[Transport >> >>> >endpoint >> >>> >>> is not connected]" repeated 264 times between [2020-04-30 >> >>> >21:46:32.129567] >> >>> >>> and [2020-04-30 21:48:29.905008] >> >>> >>> The message "W [MSGID: 114031] >> >>> >>> [client-rpc-fops_v2.c:850:client4_0_setxattr_cbk] >> >>> >>> 0-androidpolice_data3-client-0: remote operation failed >> >[Transport >> >>> >endpoint >> >>> >>> is not connected]" repeated 264 times between [2020-04-30 >> >>> >21:46:32.129602] >> >>> >>> and [2020-04-30 21:48:29.905040] >> >>> >>> The message "W [MSGID: 114031] >> >>> >>> [client-rpc-fops_v2.c:850:client4_0_setxattr_cbk] >> >>> >>> 0-androidpolice_data3-client-2: remote operation failed >> >[Transport >> >>> >endpoint >> >>> >>> is not connected]" repeated 264 times between [2020-04-30 >> >>> >21:46:32.129512] >> >>> >>> and [2020-04-30 21:48:29.905047] >> >>> >>> >> >>> >>> >> >>> >>> >> >>> >>> Once in a while, I'm seeing this: >> >>> >>> ==> bricks/mnt-hive_block4-androidpolice_data3.log <=>> >>> >>> [2020-04-30 21:45:54.251637] I [MSGID: 115072] >> >>> >>> [server-rpc-fops_v2.c:1681:server4_setattr_cbk] >> >>> >>> 0-androidpolice_data3-server: 5725811: SETATTR / >> >>> >>> >> >>> > >> >>> >> > >> androidpolice.com/public/wp-content/uploads/2019/03/cielo-breez-plus-hero.png >> >>> >>> (d4556eb4-f15b-412c-a42a-32b4438af557), client: >> >>> >>> >> >>> >> >>> >> >> >>CTX_ID:32e2d636-038a-472d-8199-007555d1805f-GRAPH_ID:0-PID:14265-HOST:nexus2-PC_NAME:androidpolice_data3-client-2-RECON_NO:-1, >> >>> >>> error-xlator: androidpolice_data3-access-control [Operation not >> >>> >permitted] >> >>> >>> [2020-04-30 21:49:10.439701] I [MSGID: 115072] >> >>> >>> [server-rpc-fops_v2.c:1680:server4_setattr_cbk] >> >>> >>> 0-androidpolice_data3-server: 201833: SETATTR / >> >>> >>> androidpolice.com/public/wp-content/uploads >> >>> >>> (2692eeba-1ebe-49b6-927f-1dfbcd227591), client: >> >>> >>> >> >>> >> >>> >> >> >>CTX_ID:af341e80-70ff-4d23-99ef-3d846a546fc9-GRAPH_ID:0-PID:2358-HOST:forge-PC_NAME:androidpolice_data3-client-3-RECON_NO:-2, >> >>> >>> error-xlator: androidpolice_data3-access-control [Operation not >> >>> >permitted] >> >>> >>> [2020-04-30 21:49:10.453724] I [MSGID: 115072] >> >>> >>> [server-rpc-fops_v2.c:1680:server4_setattr_cbk] >> >>> >>> 0-androidpolice_data3-server: 201842: SETATTR / >> >>> >>> androidpolice.com/public/wp-content/uploads >> >>> >>> (2692eeba-1ebe-49b6-927f-1dfbcd227591), client: >> >>> >>> >> >>> >> >>> >> >> >>CTX_ID:af341e80-70ff-4d23-99ef-3d846a546fc9-GRAPH_ID:0-PID:2358-HOST:forge-PC_NAME:androidpolice_data3-client-3-RECON_NO:-2, >> >>> >>> error-xlator: androidpolice_data3-access-control [Operation not >> >>> >permitted] >> >>> >>> [2020-04-30 21:49:16.224662] I [MSGID: 115072] >> >>> >>> [server-rpc-fops_v2.c:1680:server4_setattr_cbk] >> >>> >>> 0-androidpolice_data3-server: 202865: SETATTR / >> >>> >>> androidpolice.com/public/wp-content/uploads >> >>> >>> (2692eeba-1ebe-49b6-927f-1dfbcd227591), client: >> >>> >>> >> >>> >> >>> >> >> >>CTX_ID:32e2d636-038a-472d-8199-007555d1805f-GRAPH_ID:0-PID:14265-HOST:nexus2-PC_NAME:androidpolice_data3-client-3-RECON_NO:-2, >> >>> >>> error-xlator: androidpolice_data3-access-control [Operation not >> >>> >permitted] >> >>> >>> >> >>> >>> There's also lots of self-healing happening that I didn't expect >> >at >> >>> >all, >> >>> >>> since the upgrade only took ~10-15s. >> >>> >>> [2020-04-30 21:47:38.714448] I [MSGID: 108026] >> >>> >>> [afr-self-heal-metadata.c:52:__afr_selfheal_metadata_do] >> >>> >>> 0-apkmirror_data1-replicate-0: performing metadata selfheal on >> >>> >>> 4a6ba2d7-7ad8-4113-862b-02e4934a3461 >> >>> >>> [2020-04-30 21:47:38.765033] I [MSGID: 108026] >> >>> >>> [afr-self-heal-common.c:1723:afr_log_selfheal] >> >>> >>> 0-apkmirror_data1-replicate-0: Completed metadata selfheal on >> >>> >>> 4a6ba2d7-7ad8-4113-862b-02e4934a3461. sources=[3] sinks=0 1 2 >> >>> >>> [2020-04-30 21:47:38.765289] I [MSGID: 108026] >> >>> >>> [afr-self-heal-metadata.c:52:__afr_selfheal_metadata_do] >> >>> >>> 0-apkmirror_data1-replicate-0: performing metadata selfheal on >> >>> >>> f3c62a41-1864-4e75-9883-4357a7091296 >> >>> >>> [2020-04-30 21:47:38.800987] I [MSGID: 108026] >> >>> >>> [afr-self-heal-common.c:1723:afr_log_selfheal] >> >>> >>> 0-apkmirror_data1-replicate-0: Completed metadata selfheal on >> >>> >>> f3c62a41-1864-4e75-9883-4357a7091296. sources=[3] sinks=0 1 2 >> >>> >>> >> >>> >>> >> >>> >>> I'm also seeing "remote operation failed" and "writing to fuse >> >>> >device >> >>> >>> failed: No such file or directory" messages >> >>> >>> [2020-04-30 21:46:34.891957] I [MSGID: 108026] >> >>> >>> [afr-self-heal-common.c:1723:afr_log_selfheal] >> >>> >>> 0-androidpolice_data3-replicate-0: Completed metadata selfheal >> >on >> >>> >>> 2692eeba-1ebe-49b6-927f-1dfbcd227591. sources=0 1 [2] sinks=3 >> >>> >>> [2020-04-30 21:45:36.127412] W [MSGID: 114031] >> >>> >>> [client-rpc-fops_v2.c:1985:client4_0_setattr_cbk] >> >>> >>> 0-androidpolice_data3-client-0: remote operation failed >> >[Operation >> >>> >not >> >>> >>> permitted] >> >>> >>> [2020-04-30 21:45:36.345924] W [MSGID: 114031] >> >>> >>> [client-rpc-fops_v2.c:1985:client4_0_setattr_cbk] >> >>> >>> 0-androidpolice_data3-client-1: remote operation failed >> >[Operation >> >>> >not >> >>> >>> permitted] >> >>> >>> [2020-04-30 21:46:35.291853] I [MSGID: 108031] >> >>> >>> [afr-common.c:2543:afr_local_discovery_cbk] >> >>> >>> 0-androidpolice_data3-replicate-0: selecting local read_child >> >>> >>> androidpolice_data3-client-2 >> >>> >>> [2020-04-30 21:46:35.977342] I [MSGID: 108026] >> >>> >>> [afr-self-heal-metadata.c:52:__afr_selfheal_metadata_do] >> >>> >>> 0-androidpolice_data3-replicate-0: performing metadata selfheal >> >on >> >>> >>> 2692eeba-1ebe-49b6-927f-1dfbcd227591 >> >>> >>> [2020-04-30 21:46:36.006607] I [MSGID: 108026] >> >>> >>> [afr-self-heal-common.c:1723:afr_log_selfheal] >> >>> >>> 0-androidpolice_data3-replicate-0: Completed metadata selfheal >> >on >> >>> >>> 2692eeba-1ebe-49b6-927f-1dfbcd227591. sources=0 1 [2] sinks=3 >> >>> >>> [2020-04-30 21:46:37.245599] E >> >>> >[fuse-bridge.c:219:check_and_dump_fuse_W] >> >>> >>> (--> >> >>> >> >>/usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x17d)[0x7fd13d50624d] >> >>> >>> (--> >> >>> >>> >> >>> >> >>/usr/lib64/glusterfs/5.13/xlator/mount/fuse.so(+0x849a)[0x7fd1398e949a] >> >>> >>> (--> >> >>> >>> >> >>> >> >>/usr/lib64/glusterfs/5.13/xlator/mount/fuse.so(+0x87bb)[0x7fd1398e97bb] >> >>> >>> (--> /lib64/libpthread.so.0(+0x84f9)[0x7fd13ca564f9] (--> >> >>> >>> /lib64/libc.so.6(clone+0x3f)[0x7fd13c78ef2f] ))))) >> >0-glusterfs-fuse: >> >>> >>> writing to fuse device failed: No such file or directory >> >>> >>> [2020-04-30 21:46:50.864797] E >> >>> >[fuse-bridge.c:219:check_and_dump_fuse_W] >> >>> >>> (--> >> >>> >> >>/usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x17d)[0x7fd13d50624d] >> >>> >>> (--> >> >>> >>> >> >>> >> >>/usr/lib64/glusterfs/5.13/xlator/mount/fuse.so(+0x849a)[0x7fd1398e949a] >> >>> >>> (--> >> >>> >>> >> >>> >> >>/usr/lib64/glusterfs/5.13/xlator/mount/fuse.so(+0x87bb)[0x7fd1398e97bb] >> >>> >>> (--> /lib64/libpthread.so.0(+0x84f9)[0x7fd13ca564f9] (--> >> >>> >>> /lib64/libc.so.6(clone+0x3f)[0x7fd13c78ef2f] ))))) >> >0-glusterfs-fuse: >> >>> >>> writing to fuse device failed: No such file or directory >> >>> >>> >> >>> >>> The number of items being healed is going up and down wildly, >> >from 0 >> >>> >to >> >>> >>> 8000+ and sometimes taking a really long time to return a value. >> >I'm >> >>> >really >> >>> >>> worried as this is a production system, and I didn't observe >> >this in >> >>> >our >> >>> >>> test system. >> >>> >>> >> >>> >>> >> >>> >>> >> >>> >>> gluster v heal apkmirror_data1 info summary >> >>> >>> Brick nexus2:/mnt/nexus2_block1/apkmirror_data1 >> >>> >>> Status: Connected >> >>> >>> Total Number of entries: 27 >> >>> >>> Number of entries in heal pending: 27 >> >>> >>> Number of entries in split-brain: 0 >> >>> >>> Number of entries possibly healing: 0 >> >>> >>> >> >>> >>> Brick forge:/mnt/forge_block1/apkmirror_data1 >> >>> >>> Status: Connected >> >>> >>> Total Number of entries: 27 >> >>> >>> Number of entries in heal pending: 27 >> >>> >>> Number of entries in split-brain: 0 >> >>> >>> Number of entries possibly healing: 0 >> >>> >>> >> >>> >>> Brick hive:/mnt/hive_block1/apkmirror_data1 >> >>> >>> Status: Connected >> >>> >>> Total Number of entries: 27 >> >>> >>> Number of entries in heal pending: 27 >> >>> >>> Number of entries in split-brain: 0 >> >>> >>> Number of entries possibly healing: 0 >> >>> >>> >> >>> >>> Brick citadel:/mnt/citadel_block1/apkmirror_data1 >> >>> >>> Status: Connected >> >>> >>> Total Number of entries: 8540 >> >>> >>> Number of entries in heal pending: 8540 >> >>> >>> Number of entries in split-brain: 0 >> >>> >>> Number of entries possibly healing: 0 >> >>> >>> >> >>> >>> >> >>> >>> >> >>> >>> gluster v heal androidpolice_data3 info summary >> >>> >>> Brick nexus2:/mnt/nexus2_block4/androidpolice_data3 >> >>> >>> Status: Connected >> >>> >>> Total Number of entries: 1 >> >>> >>> Number of entries in heal pending: 1 >> >>> >>> Number of entries in split-brain: 0 >> >>> >>> Number of entries possibly healing: 0 >> >>> >>> >> >>> >>> Brick forge:/mnt/forge_block4/androidpolice_data3 >> >>> >>> Status: Connected >> >>> >>> Total Number of entries: 1 >> >>> >>> Number of entries in heal pending: 1 >> >>> >>> Number of entries in split-brain: 0 >> >>> >>> Number of entries possibly healing: 0 >> >>> >>> >> >>> >>> Brick hive:/mnt/hive_block4/androidpolice_data3 >> >>> >>> Status: Connected >> >>> >>> Total Number of entries: 1 >> >>> >>> Number of entries in heal pending: 1 >> >>> >>> Number of entries in split-brain: 0 >> >>> >>> Number of entries possibly healing: 0 >> >>> >>> >> >>> >>> Brick citadel:/mnt/citadel_block4/androidpolice_data3 >> >>> >>> Status: Connected >> >>> >>> Total Number of entries: 1149 >> >>> >>> Number of entries in heal pending: 1149 >> >>> >>> Number of entries in split-brain: 0 >> >>> >>> Number of entries possibly healing: 0 >> >>> >>> >> >>> >>> >> >>> >>> What should I do at this point? The files I tested seem to be >> >>> >replicating >> >>> >>> correctly, but I don't know if it's the case for all of them, >> >and >> >>> >the heals >> >>> >>> going up and down, and all these log messages are making me very >> >>> >nervous. >> >>> >>> >> >>> >>> Thank you. >> >>> >>> >> >>> >>> Sincerely, >> >>> >>> Artem >> >>> >>> >> >>> >>> -- >> >>> >>> Founder, Android Police <http://www.androidpolice.com>, APK >> >Mirror >> >>> >>> <http://www.apkmirror.com/>, Illogical Robot LLC >> >>> >>> beerpla.net | @ArtemR <http://twitter.com/ArtemR> >> >>> >>> >> >>> >> >> >>> >> >>> I's not supported , but usually it works. >> >>> >> >>> In worst case scenario, you can remove the node, wipe gluster on >> >the >> >>> node, reinstall the packages and add it - it will require full heal >> >of the >> >>> brick and as you have previously reported could lead to performance >> >>> degradation. >> >>> >> >>> I think you are on SLES, but I could be wrong . Do you have btrfs or >> >LVM >> >>> snapshots to revert from ? >> >>> >> >>> Best Regards, >> >>> Strahil Nikolov >> >>> >> >> >> >> Hi Artem, >> >> You can increase the brick log level following >> https://access.redhat.com/documentation/en-us/red_hat_gluster_storage/3/html/administration_guide/configuring_the_log_level >> but keep in mind that logs grow quite fast - so don't keep them above the >> current level for too much time. >> >> >> Do you have a geo replication running ? >> >> About the migration issue - I have no clue why this happened. Last time I >> skipped a major release(3.12 to 5.5) I got a huge trouble (all files >> ownership was switched to root) and I have the feeling that it won't >> happen again if you go through v6. >> >> Best Regards, >> Strahil Nikolov >> > ________ > > > > Community Meeting Calendar: > > Schedule - > Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC > Bridge: https://bluejeans.com/441850968 > > Gluster-users mailing list > Gluster-users at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users >-- -- https://kadalu.io Container Storage made easy! -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20200504/6b0cee68/attachment.html>