Nithya Balachandran
2019-Feb-06 08:19 UTC
[Gluster-users] Message repeated over and over after upgrade from 4.1 to 5.3: W [dict.c:761:dict_ref] (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) [0x7fd966fcd329] -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) [0x7fd9671deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) [0x7fd9731ea218] ) 2-dict: dict is NULL [Invalid argument]
Hi Artem, Do you still see the crashes with 5.3? If yes, please try mount the volume using the mount option lru-limit=0 and see if that helps. We are looking into the crashes and will update when have a fix. Also, please provide the gluster volume info for the volume in question. regards, Nithya On Tue, 5 Feb 2019 at 05:31, Artem Russakovskii <archon810 at gmail.com> wrote:> The fuse crash happened two more times, but this time monit helped recover > within 1 minute, so it's a great workaround for now. > > What's odd is that the crashes are only happening on one of 4 servers, and > I don't know why. > > Sincerely, > Artem > > -- > Founder, Android Police <http://www.androidpolice.com>, APK Mirror > <http://www.apkmirror.com/>, Illogical Robot LLC > beerpla.net | +ArtemRussakovskii > <https://plus.google.com/+ArtemRussakovskii> | @ArtemR > <http://twitter.com/ArtemR> > > > On Sat, Feb 2, 2019 at 12:14 PM Artem Russakovskii <archon810 at gmail.com> > wrote: > >> The fuse crash happened again yesterday, to another volume. Are there any >> mount options that could help mitigate this? >> >> In the meantime, I set up a monit (https://mmonit.com/monit/) task to >> watch and restart the mount, which works and recovers the mount point >> within a minute. Not ideal, but a temporary workaround. >> >> By the way, the way to reproduce this "Transport endpoint is not >> connected" condition for testing purposes is to kill -9 the right >> "glusterfs --process-name fuse" process. >> >> >> monit check: >> check filesystem glusterfs_data1 with path /mnt/glusterfs_data1 >> start program = "/bin/mount /mnt/glusterfs_data1" >> stop program = "/bin/umount /mnt/glusterfs_data1" >> if space usage > 90% for 5 times within 15 cycles >> then alert else if succeeded for 10 cycles then alert >> >> >> stack trace: >> [2019-02-01 23:22:00.312894] W [dict.c:761:dict_ref] >> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >> [0x7fa0249e4329] >> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >> [0x7fa024bf5af5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >> [0x7fa02cf5b218] ) 0-dict: dict is NULL [Invalid argument] >> [2019-02-01 23:22:00.314051] W [dict.c:761:dict_ref] >> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >> [0x7fa0249e4329] >> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >> [0x7fa024bf5af5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >> [0x7fa02cf5b218] ) 0-dict: dict is NULL [Invalid argument] >> The message "E [MSGID: 101191] >> [event-epoll.c:671:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch >> handler" repeated 26 times between [2019-02-01 23:21:20.857333] and >> [2019-02-01 23:21:56.164427] >> The message "I [MSGID: 108031] >> [afr-common.c:2543:afr_local_discovery_cbk] 0-SITE_data3-replicate-0: >> selecting local read_child SITE_data3-client-3" repeated 27 times between >> [2019-02-01 23:21:11.142467] and [2019-02-01 23:22:03.474036] >> pending frames: >> frame : type(1) op(LOOKUP) >> frame : type(0) op(0) >> patchset: git://git.gluster.org/glusterfs.git >> signal received: 6 >> time of crash: >> 2019-02-01 23:22:03 >> configuration details: >> argp 1 >> backtrace 1 >> dlfcn 1 >> libpthread 1 >> llistxattr 1 >> setfsid 1 >> spinlock 1 >> epoll.h 1 >> xattr.h 1 >> st_atim.tv_nsec 1 >> package-string: glusterfs 5.3 >> /usr/lib64/libglusterfs.so.0(+0x2764c)[0x7fa02cf6664c] >> /usr/lib64/libglusterfs.so.0(gf_print_trace+0x306)[0x7fa02cf70cb6] >> /lib64/libc.so.6(+0x36160)[0x7fa02c12d160] >> /lib64/libc.so.6(gsignal+0x110)[0x7fa02c12d0e0] >> /lib64/libc.so.6(abort+0x151)[0x7fa02c12e6c1] >> /lib64/libc.so.6(+0x2e6fa)[0x7fa02c1256fa] >> /lib64/libc.so.6(+0x2e772)[0x7fa02c125772] >> /lib64/libpthread.so.0(pthread_mutex_lock+0x228)[0x7fa02c4bb0b8] >> >> /usr/lib64/glusterfs/5.3/xlator/cluster/replicate.so(+0x5dc9d)[0x7fa025543c9d] >> >> /usr/lib64/glusterfs/5.3/xlator/cluster/replicate.so(+0x70ba1)[0x7fa025556ba1] >> >> /usr/lib64/glusterfs/5.3/xlator/protocol/client.so(+0x58f3f)[0x7fa0257dbf3f] >> /usr/lib64/libgfrpc.so.0(+0xe820)[0x7fa02cd31820] >> /usr/lib64/libgfrpc.so.0(+0xeb6f)[0x7fa02cd31b6f] >> /usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7fa02cd2e063] >> /usr/lib64/glusterfs/5.3/rpc-transport/socket.so(+0xa0b2)[0x7fa02694e0b2] >> /usr/lib64/libglusterfs.so.0(+0x854c3)[0x7fa02cfc44c3] >> /lib64/libpthread.so.0(+0x7559)[0x7fa02c4b8559] >> /lib64/libc.so.6(clone+0x3f)[0x7fa02c1ef81f] >> >> Sincerely, >> Artem >> >> -- >> Founder, Android Police <http://www.androidpolice.com>, APK Mirror >> <http://www.apkmirror.com/>, Illogical Robot LLC >> beerpla.net | +ArtemRussakovskii >> <https://plus.google.com/+ArtemRussakovskii> | @ArtemR >> <http://twitter.com/ArtemR> >> >> >> On Fri, Feb 1, 2019 at 9:03 AM Artem Russakovskii <archon810 at gmail.com> >> wrote: >> >>> Hi, >>> >>> The first (and so far only) crash happened at 2am the next day after we >>> upgraded, on only one of four servers and only to one of two mounts. >>> >>> I have no idea what caused it, but yeah, we do have a pretty busy site ( >>> apkmirror.com), and it caused a disruption for any uploads or downloads >>> from that server until I woke up and fixed the mount. >>> >>> I wish I could be more helpful but all I have is that stack trace. >>> >>> I'm glad it's a blocker and will hopefully be resolved soon. >>> >>> On Thu, Jan 31, 2019, 7:26 PM Amar Tumballi Suryanarayan < >>> atumball at redhat.com> wrote: >>> >>>> Hi Artem, >>>> >>>> Opened https://bugzilla.redhat.com/show_bug.cgi?id=1671603 (ie, as a >>>> clone of other bugs where recent discussions happened), and marked it as a >>>> blocker for glusterfs-5.4 release. >>>> >>>> We already have fixes for log flooding - >>>> https://review.gluster.org/22128, and are the process of identifying >>>> and fixing the issue seen with crash. >>>> >>>> Can you please tell if the crashes happened as soon as upgrade ? or was >>>> there any particular pattern you observed before the crash. >>>> >>>> -Amar >>>> >>>> >>>> On Thu, Jan 31, 2019 at 11:40 PM Artem Russakovskii < >>>> archon810 at gmail.com> wrote: >>>> >>>>> Within 24 hours after updating from rock solid 4.1 to 5.3, I already >>>>> got a crash which others have mentioned in >>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1313567 and had to >>>>> unmount, kill gluster, and remount: >>>>> >>>>> >>>>> [2019-01-31 09:38:04.317604] W [dict.c:761:dict_ref] >>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>> [0x7fcccafcd329] >>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>> [0x7fcccb1deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>> [0x7fccd705b218] ) 2-dict: dict is NULL [Invalid argument] >>>>> [2019-01-31 09:38:04.319308] W [dict.c:761:dict_ref] >>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>> [0x7fcccafcd329] >>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>> [0x7fcccb1deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>> [0x7fccd705b218] ) 2-dict: dict is NULL [Invalid argument] >>>>> [2019-01-31 09:38:04.320047] W [dict.c:761:dict_ref] >>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>> [0x7fcccafcd329] >>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>> [0x7fcccb1deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>> [0x7fccd705b218] ) 2-dict: dict is NULL [Invalid argument] >>>>> [2019-01-31 09:38:04.320677] W [dict.c:761:dict_ref] >>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>> [0x7fcccafcd329] >>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>> [0x7fcccb1deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>> [0x7fccd705b218] ) 2-dict: dict is NULL [Invalid argument] >>>>> The message "I [MSGID: 108031] >>>>> [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data1-replicate-0: >>>>> selecting local read_child SITE_data1-client-3" repeated 5 times between >>>>> [2019-01-31 09:37:54.751905] and [2019-01-31 09:38:03.958061] >>>>> The message "E [MSGID: 101191] >>>>> [event-epoll.c:671:event_dispatch_epoll_worker] 2-epoll: Failed to dispatch >>>>> handler" repeated 72 times between [2019-01-31 09:37:53.746741] and >>>>> [2019-01-31 09:38:04.696993] >>>>> pending frames: >>>>> frame : type(1) op(READ) >>>>> frame : type(1) op(OPEN) >>>>> frame : type(0) op(0) >>>>> patchset: git://git.gluster.org/glusterfs.git >>>>> signal received: 6 >>>>> time of crash: >>>>> 2019-01-31 09:38:04 >>>>> configuration details: >>>>> argp 1 >>>>> backtrace 1 >>>>> dlfcn 1 >>>>> libpthread 1 >>>>> llistxattr 1 >>>>> setfsid 1 >>>>> spinlock 1 >>>>> epoll.h 1 >>>>> xattr.h 1 >>>>> st_atim.tv_nsec 1 >>>>> package-string: glusterfs 5.3 >>>>> /usr/lib64/libglusterfs.so.0(+0x2764c)[0x7fccd706664c] >>>>> /usr/lib64/libglusterfs.so.0(gf_print_trace+0x306)[0x7fccd7070cb6] >>>>> /lib64/libc.so.6(+0x36160)[0x7fccd622d160] >>>>> /lib64/libc.so.6(gsignal+0x110)[0x7fccd622d0e0] >>>>> /lib64/libc.so.6(abort+0x151)[0x7fccd622e6c1] >>>>> /lib64/libc.so.6(+0x2e6fa)[0x7fccd62256fa] >>>>> /lib64/libc.so.6(+0x2e772)[0x7fccd6225772] >>>>> /lib64/libpthread.so.0(pthread_mutex_lock+0x228)[0x7fccd65bb0b8] >>>>> >>>>> /usr/lib64/glusterfs/5.3/xlator/cluster/replicate.so(+0x32c4d)[0x7fcccbb01c4d] >>>>> >>>>> /usr/lib64/glusterfs/5.3/xlator/protocol/client.so(+0x65778)[0x7fcccbdd1778] >>>>> /usr/lib64/libgfrpc.so.0(+0xe820)[0x7fccd6e31820] >>>>> /usr/lib64/libgfrpc.so.0(+0xeb6f)[0x7fccd6e31b6f] >>>>> /usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7fccd6e2e063] >>>>> >>>>> /usr/lib64/glusterfs/5.3/rpc-transport/socket.so(+0xa0b2)[0x7fccd0b7e0b2] >>>>> /usr/lib64/libglusterfs.so.0(+0x854c3)[0x7fccd70c44c3] >>>>> /lib64/libpthread.so.0(+0x7559)[0x7fccd65b8559] >>>>> /lib64/libc.so.6(clone+0x3f)[0x7fccd62ef81f] >>>>> --------- >>>>> >>>>> Do the pending patches fix the crash or only the repeated warnings? >>>>> I'm running glusterfs on OpenSUSE 15.0 installed via >>>>> http://download.opensuse.org/repositories/home:/glusterfs:/Leap15-5/openSUSE_Leap_15.0/, >>>>> not too sure how to make it core dump. >>>>> >>>>> If it's not fixed by the patches above, has anyone already opened a >>>>> ticket for the crashes that I can join and monitor? This is going to create >>>>> a massive problem for us since production systems are crashing. >>>>> >>>>> Thanks. >>>>> >>>>> Sincerely, >>>>> Artem >>>>> >>>>> -- >>>>> Founder, Android Police <http://www.androidpolice.com>, APK Mirror >>>>> <http://www.apkmirror.com/>, Illogical Robot LLC >>>>> beerpla.net | +ArtemRussakovskii >>>>> <https://plus.google.com/+ArtemRussakovskii> | @ArtemR >>>>> <http://twitter.com/ArtemR> >>>>> >>>>> >>>>> On Wed, Jan 30, 2019 at 6:37 PM Raghavendra Gowdappa < >>>>> rgowdapp at redhat.com> wrote: >>>>> >>>>>> >>>>>> >>>>>> On Thu, Jan 31, 2019 at 2:14 AM Artem Russakovskii < >>>>>> archon810 at gmail.com> wrote: >>>>>> >>>>>>> Also, not sure if related or not, but I got a ton of these "Failed >>>>>>> to dispatch handler" in my logs as well. Many people have been commenting >>>>>>> about this issue here >>>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1651246. >>>>>>> >>>>>> >>>>>> https://review.gluster.org/#/c/glusterfs/+/22046/ addresses this. >>>>>> >>>>>> >>>>>>> ==> mnt-SITE_data1.log <=>>>>>>>> [2019-01-30 20:38:20.783713] W [dict.c:761:dict_ref] >>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>> [0x7fd966fcd329] >>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>> [0x7fd9671deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>> [0x7fd9731ea218] ) 2-dict: dict is NULL [Invalid argument] >>>>>>>> ==> mnt-SITE_data3.log <=>>>>>>>> The message "E [MSGID: 101191] >>>>>>>> [event-epoll.c:671:event_dispatch_epoll_worker] 2-epoll: Failed to dispatch >>>>>>>> handler" repeated 413 times between [2019-01-30 20:36:23.881090] and >>>>>>>> [2019-01-30 20:38:20.015593] >>>>>>>> The message "I [MSGID: 108031] >>>>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data3-replicate-0: >>>>>>>> selecting local read_child SITE_data3-client-0" repeated 42 times between >>>>>>>> [2019-01-30 20:36:23.290287] and [2019-01-30 20:38:20.280306] >>>>>>>> ==> mnt-SITE_data1.log <=>>>>>>>> The message "I [MSGID: 108031] >>>>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data1-replicate-0: >>>>>>>> selecting local read_child SITE_data1-client-0" repeated 50 times between >>>>>>>> [2019-01-30 20:36:22.247367] and [2019-01-30 20:38:19.459789] >>>>>>>> The message "E [MSGID: 101191] >>>>>>>> [event-epoll.c:671:event_dispatch_epoll_worker] 2-epoll: Failed to dispatch >>>>>>>> handler" repeated 2654 times between [2019-01-30 20:36:22.667327] and >>>>>>>> [2019-01-30 20:38:20.546355] >>>>>>>> [2019-01-30 20:38:21.492319] I [MSGID: 108031] >>>>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data1-replicate-0: >>>>>>>> selecting local read_child SITE_data1-client-0 >>>>>>>> ==> mnt-SITE_data3.log <=>>>>>>>> [2019-01-30 20:38:22.349689] I [MSGID: 108031] >>>>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data3-replicate-0: >>>>>>>> selecting local read_child SITE_data3-client-0 >>>>>>>> ==> mnt-SITE_data1.log <=>>>>>>>> [2019-01-30 20:38:22.762941] E [MSGID: 101191] >>>>>>>> [event-epoll.c:671:event_dispatch_epoll_worker] 2-epoll: Failed to dispatch >>>>>>>> handler >>>>>>> >>>>>>> >>>>>>> I'm hoping raising the issue here on the mailing list may bring some >>>>>>> additional eyeballs and get them both fixed. >>>>>>> >>>>>>> Thanks. >>>>>>> >>>>>>> Sincerely, >>>>>>> Artem >>>>>>> >>>>>>> -- >>>>>>> Founder, Android Police <http://www.androidpolice.com>, APK Mirror >>>>>>> <http://www.apkmirror.com/>, Illogical Robot LLC >>>>>>> beerpla.net | +ArtemRussakovskii >>>>>>> <https://plus.google.com/+ArtemRussakovskii> | @ArtemR >>>>>>> <http://twitter.com/ArtemR> >>>>>>> >>>>>>> >>>>>>> On Wed, Jan 30, 2019 at 12:26 PM Artem Russakovskii < >>>>>>> archon810 at gmail.com> wrote: >>>>>>> >>>>>>>> I found a similar issue here: >>>>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1313567. There's a >>>>>>>> comment from 3 days ago from someone else with 5.3 who started seeing the >>>>>>>> spam. >>>>>>>> >>>>>>>> Here's the command that repeats over and over: >>>>>>>> [2019-01-30 20:23:24.481581] W [dict.c:761:dict_ref] >>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>> [0x7fd966fcd329] >>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>> [0x7fd9671deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>> [0x7fd9731ea218] ) 2-dict: dict is NULL [Invalid argument] >>>>>>>> >>>>>>> >>>>>> +Milind Changire <mchangir at redhat.com> Can you check why this >>>>>> message is logged and send a fix? >>>>>> >>>>>> >>>>>>>> Is there any fix for this issue? >>>>>>>> >>>>>>>> Thanks. >>>>>>>> >>>>>>>> Sincerely, >>>>>>>> Artem >>>>>>>> >>>>>>>> -- >>>>>>>> Founder, Android Police <http://www.androidpolice.com>, APK Mirror >>>>>>>> <http://www.apkmirror.com/>, Illogical Robot LLC >>>>>>>> beerpla.net | +ArtemRussakovskii >>>>>>>> <https://plus.google.com/+ArtemRussakovskii> | @ArtemR >>>>>>>> <http://twitter.com/ArtemR> >>>>>>>> >>>>>>> _______________________________________________ >>>>>>> Gluster-users mailing list >>>>>>> Gluster-users at gluster.org >>>>>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>>>>> >>>>>> _______________________________________________ >>>>> Gluster-users mailing list >>>>> Gluster-users at gluster.org >>>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>>> >>>> >>>> >>>> -- >>>> Amar Tumballi (amarts) >>>> >>> _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20190206/16aee0d3/attachment.html>
Artem Russakovskii
2019-Feb-06 18:48 UTC
[Gluster-users] Message repeated over and over after upgrade from 4.1 to 5.3: W [dict.c:761:dict_ref] (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) [0x7fd966fcd329] -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) [0x7fd9671deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) [0x7fd9731ea218] ) 2-dict: dict is NULL [Invalid argument]
Hi Nithya, Indeed, I upgraded from 4.1 to 5.3, at which point I started seeing crashes, and no further releases have been made yet. volume info: Type: Replicate Volume ID: ****SNIP**** Status: Started Snapshot Count: 0 Number of Bricks: 1 x 4 = 4 Transport-type: tcp Bricks: Brick1: ****SNIP**** Brick2: ****SNIP**** Brick3: ****SNIP**** Brick4: ****SNIP**** Options Reconfigured: cluster.quorum-count: 1 cluster.quorum-type: fixed network.ping-timeout: 5 network.remote-dio: enable performance.rda-cache-limit: 256MB performance.readdir-ahead: on performance.parallel-readdir: on network.inode-lru-limit: 500000 performance.md-cache-timeout: 600 performance.cache-invalidation: on performance.stat-prefetch: on features.cache-invalidation-timeout: 600 features.cache-invalidation: on cluster.readdir-optimize: on performance.io-thread-count: 32 server.event-threads: 4 client.event-threads: 4 performance.read-ahead: off cluster.lookup-optimize: on performance.cache-size: 1GB cluster.self-heal-daemon: enable transport.address-family: inet nfs.disable: on performance.client-io-threads: on cluster.granular-entry-heal: enable cluster.data-self-heal-algorithm: full Sincerely, Artem -- Founder, Android Police <http://www.androidpolice.com>, APK Mirror <http://www.apkmirror.com/>, Illogical Robot LLC beerpla.net | +ArtemRussakovskii <https://plus.google.com/+ArtemRussakovskii> | @ArtemR <http://twitter.com/ArtemR> On Wed, Feb 6, 2019 at 12:20 AM Nithya Balachandran <nbalacha at redhat.com> wrote:> Hi Artem, > > Do you still see the crashes with 5.3? If yes, please try mount the volume > using the mount option lru-limit=0 and see if that helps. We are looking > into the crashes and will update when have a fix. > > Also, please provide the gluster volume info for the volume in question. > > > regards, > Nithya > > On Tue, 5 Feb 2019 at 05:31, Artem Russakovskii <archon810 at gmail.com> > wrote: > >> The fuse crash happened two more times, but this time monit helped >> recover within 1 minute, so it's a great workaround for now. >> >> What's odd is that the crashes are only happening on one of 4 servers, >> and I don't know why. >> >> Sincerely, >> Artem >> >> -- >> Founder, Android Police <http://www.androidpolice.com>, APK Mirror >> <http://www.apkmirror.com/>, Illogical Robot LLC >> beerpla.net | +ArtemRussakovskii >> <https://plus.google.com/+ArtemRussakovskii> | @ArtemR >> <http://twitter.com/ArtemR> >> >> >> On Sat, Feb 2, 2019 at 12:14 PM Artem Russakovskii <archon810 at gmail.com> >> wrote: >> >>> The fuse crash happened again yesterday, to another volume. Are there >>> any mount options that could help mitigate this? >>> >>> In the meantime, I set up a monit (https://mmonit.com/monit/) task to >>> watch and restart the mount, which works and recovers the mount point >>> within a minute. Not ideal, but a temporary workaround. >>> >>> By the way, the way to reproduce this "Transport endpoint is not >>> connected" condition for testing purposes is to kill -9 the right >>> "glusterfs --process-name fuse" process. >>> >>> >>> monit check: >>> check filesystem glusterfs_data1 with path /mnt/glusterfs_data1 >>> start program = "/bin/mount /mnt/glusterfs_data1" >>> stop program = "/bin/umount /mnt/glusterfs_data1" >>> if space usage > 90% for 5 times within 15 cycles >>> then alert else if succeeded for 10 cycles then alert >>> >>> >>> stack trace: >>> [2019-02-01 23:22:00.312894] W [dict.c:761:dict_ref] >>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>> [0x7fa0249e4329] >>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>> [0x7fa024bf5af5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>> [0x7fa02cf5b218] ) 0-dict: dict is NULL [Invalid argument] >>> [2019-02-01 23:22:00.314051] W [dict.c:761:dict_ref] >>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>> [0x7fa0249e4329] >>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>> [0x7fa024bf5af5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>> [0x7fa02cf5b218] ) 0-dict: dict is NULL [Invalid argument] >>> The message "E [MSGID: 101191] >>> [event-epoll.c:671:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch >>> handler" repeated 26 times between [2019-02-01 23:21:20.857333] and >>> [2019-02-01 23:21:56.164427] >>> The message "I [MSGID: 108031] >>> [afr-common.c:2543:afr_local_discovery_cbk] 0-SITE_data3-replicate-0: >>> selecting local read_child SITE_data3-client-3" repeated 27 times between >>> [2019-02-01 23:21:11.142467] and [2019-02-01 23:22:03.474036] >>> pending frames: >>> frame : type(1) op(LOOKUP) >>> frame : type(0) op(0) >>> patchset: git://git.gluster.org/glusterfs.git >>> signal received: 6 >>> time of crash: >>> 2019-02-01 23:22:03 >>> configuration details: >>> argp 1 >>> backtrace 1 >>> dlfcn 1 >>> libpthread 1 >>> llistxattr 1 >>> setfsid 1 >>> spinlock 1 >>> epoll.h 1 >>> xattr.h 1 >>> st_atim.tv_nsec 1 >>> package-string: glusterfs 5.3 >>> /usr/lib64/libglusterfs.so.0(+0x2764c)[0x7fa02cf6664c] >>> /usr/lib64/libglusterfs.so.0(gf_print_trace+0x306)[0x7fa02cf70cb6] >>> /lib64/libc.so.6(+0x36160)[0x7fa02c12d160] >>> /lib64/libc.so.6(gsignal+0x110)[0x7fa02c12d0e0] >>> /lib64/libc.so.6(abort+0x151)[0x7fa02c12e6c1] >>> /lib64/libc.so.6(+0x2e6fa)[0x7fa02c1256fa] >>> /lib64/libc.so.6(+0x2e772)[0x7fa02c125772] >>> /lib64/libpthread.so.0(pthread_mutex_lock+0x228)[0x7fa02c4bb0b8] >>> >>> /usr/lib64/glusterfs/5.3/xlator/cluster/replicate.so(+0x5dc9d)[0x7fa025543c9d] >>> >>> /usr/lib64/glusterfs/5.3/xlator/cluster/replicate.so(+0x70ba1)[0x7fa025556ba1] >>> >>> /usr/lib64/glusterfs/5.3/xlator/protocol/client.so(+0x58f3f)[0x7fa0257dbf3f] >>> /usr/lib64/libgfrpc.so.0(+0xe820)[0x7fa02cd31820] >>> /usr/lib64/libgfrpc.so.0(+0xeb6f)[0x7fa02cd31b6f] >>> /usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7fa02cd2e063] >>> /usr/lib64/glusterfs/5.3/rpc-transport/socket.so(+0xa0b2)[0x7fa02694e0b2] >>> /usr/lib64/libglusterfs.so.0(+0x854c3)[0x7fa02cfc44c3] >>> /lib64/libpthread.so.0(+0x7559)[0x7fa02c4b8559] >>> /lib64/libc.so.6(clone+0x3f)[0x7fa02c1ef81f] >>> >>> Sincerely, >>> Artem >>> >>> -- >>> Founder, Android Police <http://www.androidpolice.com>, APK Mirror >>> <http://www.apkmirror.com/>, Illogical Robot LLC >>> beerpla.net | +ArtemRussakovskii >>> <https://plus.google.com/+ArtemRussakovskii> | @ArtemR >>> <http://twitter.com/ArtemR> >>> >>> >>> On Fri, Feb 1, 2019 at 9:03 AM Artem Russakovskii <archon810 at gmail.com> >>> wrote: >>> >>>> Hi, >>>> >>>> The first (and so far only) crash happened at 2am the next day after we >>>> upgraded, on only one of four servers and only to one of two mounts. >>>> >>>> I have no idea what caused it, but yeah, we do have a pretty busy site ( >>>> apkmirror.com), and it caused a disruption for any uploads or >>>> downloads from that server until I woke up and fixed the mount. >>>> >>>> I wish I could be more helpful but all I have is that stack trace. >>>> >>>> I'm glad it's a blocker and will hopefully be resolved soon. >>>> >>>> On Thu, Jan 31, 2019, 7:26 PM Amar Tumballi Suryanarayan < >>>> atumball at redhat.com> wrote: >>>> >>>>> Hi Artem, >>>>> >>>>> Opened https://bugzilla.redhat.com/show_bug.cgi?id=1671603 (ie, as a >>>>> clone of other bugs where recent discussions happened), and marked it as a >>>>> blocker for glusterfs-5.4 release. >>>>> >>>>> We already have fixes for log flooding - >>>>> https://review.gluster.org/22128, and are the process of identifying >>>>> and fixing the issue seen with crash. >>>>> >>>>> Can you please tell if the crashes happened as soon as upgrade ? or >>>>> was there any particular pattern you observed before the crash. >>>>> >>>>> -Amar >>>>> >>>>> >>>>> On Thu, Jan 31, 2019 at 11:40 PM Artem Russakovskii < >>>>> archon810 at gmail.com> wrote: >>>>> >>>>>> Within 24 hours after updating from rock solid 4.1 to 5.3, I already >>>>>> got a crash which others have mentioned in >>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1313567 and had to >>>>>> unmount, kill gluster, and remount: >>>>>> >>>>>> >>>>>> [2019-01-31 09:38:04.317604] W [dict.c:761:dict_ref] >>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>> [0x7fcccafcd329] >>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>> [0x7fcccb1deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>> [0x7fccd705b218] ) 2-dict: dict is NULL [Invalid argument] >>>>>> [2019-01-31 09:38:04.319308] W [dict.c:761:dict_ref] >>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>> [0x7fcccafcd329] >>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>> [0x7fcccb1deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>> [0x7fccd705b218] ) 2-dict: dict is NULL [Invalid argument] >>>>>> [2019-01-31 09:38:04.320047] W [dict.c:761:dict_ref] >>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>> [0x7fcccafcd329] >>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>> [0x7fcccb1deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>> [0x7fccd705b218] ) 2-dict: dict is NULL [Invalid argument] >>>>>> [2019-01-31 09:38:04.320677] W [dict.c:761:dict_ref] >>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>> [0x7fcccafcd329] >>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>> [0x7fcccb1deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>> [0x7fccd705b218] ) 2-dict: dict is NULL [Invalid argument] >>>>>> The message "I [MSGID: 108031] >>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data1-replicate-0: >>>>>> selecting local read_child SITE_data1-client-3" repeated 5 times between >>>>>> [2019-01-31 09:37:54.751905] and [2019-01-31 09:38:03.958061] >>>>>> The message "E [MSGID: 101191] >>>>>> [event-epoll.c:671:event_dispatch_epoll_worker] 2-epoll: Failed to dispatch >>>>>> handler" repeated 72 times between [2019-01-31 09:37:53.746741] and >>>>>> [2019-01-31 09:38:04.696993] >>>>>> pending frames: >>>>>> frame : type(1) op(READ) >>>>>> frame : type(1) op(OPEN) >>>>>> frame : type(0) op(0) >>>>>> patchset: git://git.gluster.org/glusterfs.git >>>>>> signal received: 6 >>>>>> time of crash: >>>>>> 2019-01-31 09:38:04 >>>>>> configuration details: >>>>>> argp 1 >>>>>> backtrace 1 >>>>>> dlfcn 1 >>>>>> libpthread 1 >>>>>> llistxattr 1 >>>>>> setfsid 1 >>>>>> spinlock 1 >>>>>> epoll.h 1 >>>>>> xattr.h 1 >>>>>> st_atim.tv_nsec 1 >>>>>> package-string: glusterfs 5.3 >>>>>> /usr/lib64/libglusterfs.so.0(+0x2764c)[0x7fccd706664c] >>>>>> /usr/lib64/libglusterfs.so.0(gf_print_trace+0x306)[0x7fccd7070cb6] >>>>>> /lib64/libc.so.6(+0x36160)[0x7fccd622d160] >>>>>> /lib64/libc.so.6(gsignal+0x110)[0x7fccd622d0e0] >>>>>> /lib64/libc.so.6(abort+0x151)[0x7fccd622e6c1] >>>>>> /lib64/libc.so.6(+0x2e6fa)[0x7fccd62256fa] >>>>>> /lib64/libc.so.6(+0x2e772)[0x7fccd6225772] >>>>>> /lib64/libpthread.so.0(pthread_mutex_lock+0x228)[0x7fccd65bb0b8] >>>>>> >>>>>> /usr/lib64/glusterfs/5.3/xlator/cluster/replicate.so(+0x32c4d)[0x7fcccbb01c4d] >>>>>> >>>>>> /usr/lib64/glusterfs/5.3/xlator/protocol/client.so(+0x65778)[0x7fcccbdd1778] >>>>>> /usr/lib64/libgfrpc.so.0(+0xe820)[0x7fccd6e31820] >>>>>> /usr/lib64/libgfrpc.so.0(+0xeb6f)[0x7fccd6e31b6f] >>>>>> /usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7fccd6e2e063] >>>>>> >>>>>> /usr/lib64/glusterfs/5.3/rpc-transport/socket.so(+0xa0b2)[0x7fccd0b7e0b2] >>>>>> /usr/lib64/libglusterfs.so.0(+0x854c3)[0x7fccd70c44c3] >>>>>> /lib64/libpthread.so.0(+0x7559)[0x7fccd65b8559] >>>>>> /lib64/libc.so.6(clone+0x3f)[0x7fccd62ef81f] >>>>>> --------- >>>>>> >>>>>> Do the pending patches fix the crash or only the repeated warnings? >>>>>> I'm running glusterfs on OpenSUSE 15.0 installed via >>>>>> http://download.opensuse.org/repositories/home:/glusterfs:/Leap15-5/openSUSE_Leap_15.0/, >>>>>> not too sure how to make it core dump. >>>>>> >>>>>> If it's not fixed by the patches above, has anyone already opened a >>>>>> ticket for the crashes that I can join and monitor? This is going to create >>>>>> a massive problem for us since production systems are crashing. >>>>>> >>>>>> Thanks. >>>>>> >>>>>> Sincerely, >>>>>> Artem >>>>>> >>>>>> -- >>>>>> Founder, Android Police <http://www.androidpolice.com>, APK Mirror >>>>>> <http://www.apkmirror.com/>, Illogical Robot LLC >>>>>> beerpla.net | +ArtemRussakovskii >>>>>> <https://plus.google.com/+ArtemRussakovskii> | @ArtemR >>>>>> <http://twitter.com/ArtemR> >>>>>> >>>>>> >>>>>> On Wed, Jan 30, 2019 at 6:37 PM Raghavendra Gowdappa < >>>>>> rgowdapp at redhat.com> wrote: >>>>>> >>>>>>> >>>>>>> >>>>>>> On Thu, Jan 31, 2019 at 2:14 AM Artem Russakovskii < >>>>>>> archon810 at gmail.com> wrote: >>>>>>> >>>>>>>> Also, not sure if related or not, but I got a ton of these "Failed >>>>>>>> to dispatch handler" in my logs as well. Many people have been commenting >>>>>>>> about this issue here >>>>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1651246. >>>>>>>> >>>>>>> >>>>>>> https://review.gluster.org/#/c/glusterfs/+/22046/ addresses this. >>>>>>> >>>>>>> >>>>>>>> ==> mnt-SITE_data1.log <=>>>>>>>>> [2019-01-30 20:38:20.783713] W [dict.c:761:dict_ref] >>>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>>> [0x7fd966fcd329] >>>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>>> [0x7fd9671deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>>> [0x7fd9731ea218] ) 2-dict: dict is NULL [Invalid argument] >>>>>>>>> ==> mnt-SITE_data3.log <=>>>>>>>>> The message "E [MSGID: 101191] >>>>>>>>> [event-epoll.c:671:event_dispatch_epoll_worker] 2-epoll: Failed to dispatch >>>>>>>>> handler" repeated 413 times between [2019-01-30 20:36:23.881090] and >>>>>>>>> [2019-01-30 20:38:20.015593] >>>>>>>>> The message "I [MSGID: 108031] >>>>>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data3-replicate-0: >>>>>>>>> selecting local read_child SITE_data3-client-0" repeated 42 times between >>>>>>>>> [2019-01-30 20:36:23.290287] and [2019-01-30 20:38:20.280306] >>>>>>>>> ==> mnt-SITE_data1.log <=>>>>>>>>> The message "I [MSGID: 108031] >>>>>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data1-replicate-0: >>>>>>>>> selecting local read_child SITE_data1-client-0" repeated 50 times between >>>>>>>>> [2019-01-30 20:36:22.247367] and [2019-01-30 20:38:19.459789] >>>>>>>>> The message "E [MSGID: 101191] >>>>>>>>> [event-epoll.c:671:event_dispatch_epoll_worker] 2-epoll: Failed to dispatch >>>>>>>>> handler" repeated 2654 times between [2019-01-30 20:36:22.667327] and >>>>>>>>> [2019-01-30 20:38:20.546355] >>>>>>>>> [2019-01-30 20:38:21.492319] I [MSGID: 108031] >>>>>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data1-replicate-0: >>>>>>>>> selecting local read_child SITE_data1-client-0 >>>>>>>>> ==> mnt-SITE_data3.log <=>>>>>>>>> [2019-01-30 20:38:22.349689] I [MSGID: 108031] >>>>>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data3-replicate-0: >>>>>>>>> selecting local read_child SITE_data3-client-0 >>>>>>>>> ==> mnt-SITE_data1.log <=>>>>>>>>> [2019-01-30 20:38:22.762941] E [MSGID: 101191] >>>>>>>>> [event-epoll.c:671:event_dispatch_epoll_worker] 2-epoll: Failed to dispatch >>>>>>>>> handler >>>>>>>> >>>>>>>> >>>>>>>> I'm hoping raising the issue here on the mailing list may bring >>>>>>>> some additional eyeballs and get them both fixed. >>>>>>>> >>>>>>>> Thanks. >>>>>>>> >>>>>>>> Sincerely, >>>>>>>> Artem >>>>>>>> >>>>>>>> -- >>>>>>>> Founder, Android Police <http://www.androidpolice.com>, APK Mirror >>>>>>>> <http://www.apkmirror.com/>, Illogical Robot LLC >>>>>>>> beerpla.net | +ArtemRussakovskii >>>>>>>> <https://plus.google.com/+ArtemRussakovskii> | @ArtemR >>>>>>>> <http://twitter.com/ArtemR> >>>>>>>> >>>>>>>> >>>>>>>> On Wed, Jan 30, 2019 at 12:26 PM Artem Russakovskii < >>>>>>>> archon810 at gmail.com> wrote: >>>>>>>> >>>>>>>>> I found a similar issue here: >>>>>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1313567. There's a >>>>>>>>> comment from 3 days ago from someone else with 5.3 who started seeing the >>>>>>>>> spam. >>>>>>>>> >>>>>>>>> Here's the command that repeats over and over: >>>>>>>>> [2019-01-30 20:23:24.481581] W [dict.c:761:dict_ref] >>>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>>> [0x7fd966fcd329] >>>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>>> [0x7fd9671deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>>> [0x7fd9731ea218] ) 2-dict: dict is NULL [Invalid argument] >>>>>>>>> >>>>>>>> >>>>>>> +Milind Changire <mchangir at redhat.com> Can you check why this >>>>>>> message is logged and send a fix? >>>>>>> >>>>>>> >>>>>>>>> Is there any fix for this issue? >>>>>>>>> >>>>>>>>> Thanks. >>>>>>>>> >>>>>>>>> Sincerely, >>>>>>>>> Artem >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Founder, Android Police <http://www.androidpolice.com>, APK Mirror >>>>>>>>> <http://www.apkmirror.com/>, Illogical Robot LLC >>>>>>>>> beerpla.net | +ArtemRussakovskii >>>>>>>>> <https://plus.google.com/+ArtemRussakovskii> | @ArtemR >>>>>>>>> <http://twitter.com/ArtemR> >>>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> Gluster-users mailing list >>>>>>>> Gluster-users at gluster.org >>>>>>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>>>>>> >>>>>>> _______________________________________________ >>>>>> Gluster-users mailing list >>>>>> Gluster-users at gluster.org >>>>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>>>> >>>>> >>>>> >>>>> -- >>>>> Amar Tumballi (amarts) >>>>> >>>> _______________________________________________ >> Gluster-users mailing list >> Gluster-users at gluster.org >> https://lists.gluster.org/mailman/listinfo/gluster-users > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20190206/31a0db26/attachment.html>