Nithya Balachandran
2019-Feb-08 03:05 UTC
[Gluster-users] Message repeated over and over after upgrade from 4.1 to 5.3: W [dict.c:761:dict_ref] (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) [0x7fd966fcd329] -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) [0x7fd9671deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) [0x7fd9731ea218] ) 2-dict: dict is NULL [Invalid argument]
Thanks Artem. Can you send us the coredump or the bt with symbols from the crash? Regards, Nithya On Fri, 8 Feb 2019 at 06:51, Artem Russakovskii <archon810 at gmail.com> wrote:> Sorry to disappoint, but the crash just happened again, so lru-limit=0 > didn't help. > > Here's the snippet of the crash and the subsequent remount by monit. > > > [2019-02-08 01:13:05.854391] W [dict.c:761:dict_ref] > (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) > [0x7f4402b99329] > -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) > [0x7f4402daaaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) > [0x7f440b6b5218] ) 0-dict: dict is NULL [In > valid argument] > The message "I [MSGID: 108031] [afr-common.c:2543:afr_local_discovery_cbk] > 0-<SNIP>_data1-replicate-0: selecting local read_child > <SNIP>_data1-client-3" repeated 39 times between [2019-02-08 > 01:11:18.043286] and [2019-02-08 01:13:07.915604] > The message "E [MSGID: 101191] > [event-epoll.c:671:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch > handler" repeated 515 times between [2019-02-08 01:11:17.932515] and > [2019-02-08 01:13:09.311554] > pending frames: > frame : type(1) op(LOOKUP) > frame : type(0) op(0) > patchset: git://git.gluster.org/glusterfs.git > signal received: 6 > time of crash: > 2019-02-08 01:13:09 > configuration details: > argp 1 > backtrace 1 > dlfcn 1 > libpthread 1 > llistxattr 1 > setfsid 1 > spinlock 1 > epoll.h 1 > xattr.h 1 > st_atim.tv_nsec 1 > package-string: glusterfs 5.3 > /usr/lib64/libglusterfs.so.0(+0x2764c)[0x7f440b6c064c] > /usr/lib64/libglusterfs.so.0(gf_print_trace+0x306)[0x7f440b6cacb6] > /lib64/libc.so.6(+0x36160)[0x7f440a887160] > /lib64/libc.so.6(gsignal+0x110)[0x7f440a8870e0] > /lib64/libc.so.6(abort+0x151)[0x7f440a8886c1] > /lib64/libc.so.6(+0x2e6fa)[0x7f440a87f6fa] > /lib64/libc.so.6(+0x2e772)[0x7f440a87f772] > /lib64/libpthread.so.0(pthread_mutex_lock+0x228)[0x7f440ac150b8] > > /usr/lib64/glusterfs/5.3/xlator/cluster/replicate.so(+0x5dc9d)[0x7f44036f8c9d] > > /usr/lib64/glusterfs/5.3/xlator/cluster/replicate.so(+0x70ba1)[0x7f440370bba1] > > /usr/lib64/glusterfs/5.3/xlator/protocol/client.so(+0x58f3f)[0x7f4403990f3f] > /usr/lib64/libgfrpc.so.0(+0xe820)[0x7f440b48b820] > /usr/lib64/libgfrpc.so.0(+0xeb6f)[0x7f440b48bb6f] > /usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7f440b488063] > /usr/lib64/glusterfs/5.3/rpc-transport/socket.so(+0xa0b2)[0x7f44050a80b2] > /usr/lib64/libglusterfs.so.0(+0x854c3)[0x7f440b71e4c3] > /lib64/libpthread.so.0(+0x7559)[0x7f440ac12559] > /lib64/libc.so.6(clone+0x3f)[0x7f440a94981f] > --------- > [2019-02-08 01:13:35.628478] I [MSGID: 100030] [glusterfsd.c:2715:main] > 0-/usr/sbin/glusterfs: Started running /usr/sbin/glusterfs version 5.3 > (args: /usr/sbin/glusterfs --lru-limit=0 --process-name fuse > --volfile-server=localhost --volfile-id=/<SNIP>_data1 /mnt/<SNIP>_data1) > [2019-02-08 01:13:35.637830] I [MSGID: 101190] > [event-epoll.c:622:event_dispatch_epoll_worker] 0-epoll: Started thread > with index 1 > [2019-02-08 01:13:35.651405] I [MSGID: 101190] > [event-epoll.c:622:event_dispatch_epoll_worker] 0-epoll: Started thread > with index 2 > [2019-02-08 01:13:35.651628] I [MSGID: 101190] > [event-epoll.c:622:event_dispatch_epoll_worker] 0-epoll: Started thread > with index 3 > [2019-02-08 01:13:35.651747] I [MSGID: 101190] > [event-epoll.c:622:event_dispatch_epoll_worker] 0-epoll: Started thread > with index 4 > [2019-02-08 01:13:35.652575] I [MSGID: 114020] [client.c:2354:notify] > 0-<SNIP>_data1-client-0: parent translators are ready, attempting connect > on transport > [2019-02-08 01:13:35.652978] I [MSGID: 114020] [client.c:2354:notify] > 0-<SNIP>_data1-client-1: parent translators are ready, attempting connect > on transport > [2019-02-08 01:13:35.655197] I [MSGID: 114020] [client.c:2354:notify] > 0-<SNIP>_data1-client-2: parent translators are ready, attempting connect > on transport > [2019-02-08 01:13:35.655497] I [MSGID: 114020] [client.c:2354:notify] > 0-<SNIP>_data1-client-3: parent translators are ready, attempting connect > on transport > [2019-02-08 01:13:35.655527] I [rpc-clnt.c:2042:rpc_clnt_reconfig] > 0-<SNIP>_data1-client-0: changing port to 49153 (from 0) > Final graph: > > > Sincerely, > Artem > > -- > Founder, Android Police <http://www.androidpolice.com>, APK Mirror > <http://www.apkmirror.com/>, Illogical Robot LLC > beerpla.net | +ArtemRussakovskii > <https://plus.google.com/+ArtemRussakovskii> | @ArtemR > <http://twitter.com/ArtemR> > > > On Thu, Feb 7, 2019 at 1:28 PM Artem Russakovskii <archon810 at gmail.com> > wrote: > >> I've added the lru-limit=0 parameter to the mounts, and I see it's taken >> effect correctly: >> "/usr/sbin/glusterfs --lru-limit=0 --process-name fuse >> --volfile-server=localhost --volfile-id=/<SNIP> /mnt/<SNIP>" >> >> Let's see if it stops crashing or not. >> >> Sincerely, >> Artem >> >> -- >> Founder, Android Police <http://www.androidpolice.com>, APK Mirror >> <http://www.apkmirror.com/>, Illogical Robot LLC >> beerpla.net | +ArtemRussakovskii >> <https://plus.google.com/+ArtemRussakovskii> | @ArtemR >> <http://twitter.com/ArtemR> >> >> >> On Wed, Feb 6, 2019 at 10:48 AM Artem Russakovskii <archon810 at gmail.com> >> wrote: >> >>> Hi Nithya, >>> >>> Indeed, I upgraded from 4.1 to 5.3, at which point I started seeing >>> crashes, and no further releases have been made yet. >>> >>> volume info: >>> Type: Replicate >>> Volume ID: ****SNIP**** >>> Status: Started >>> Snapshot Count: 0 >>> Number of Bricks: 1 x 4 = 4 >>> Transport-type: tcp >>> Bricks: >>> Brick1: ****SNIP**** >>> Brick2: ****SNIP**** >>> Brick3: ****SNIP**** >>> Brick4: ****SNIP**** >>> Options Reconfigured: >>> cluster.quorum-count: 1 >>> cluster.quorum-type: fixed >>> network.ping-timeout: 5 >>> network.remote-dio: enable >>> performance.rda-cache-limit: 256MB >>> performance.readdir-ahead: on >>> performance.parallel-readdir: on >>> network.inode-lru-limit: 500000 >>> performance.md-cache-timeout: 600 >>> performance.cache-invalidation: on >>> performance.stat-prefetch: on >>> features.cache-invalidation-timeout: 600 >>> features.cache-invalidation: on >>> cluster.readdir-optimize: on >>> performance.io-thread-count: 32 >>> server.event-threads: 4 >>> client.event-threads: 4 >>> performance.read-ahead: off >>> cluster.lookup-optimize: on >>> performance.cache-size: 1GB >>> cluster.self-heal-daemon: enable >>> transport.address-family: inet >>> nfs.disable: on >>> performance.client-io-threads: on >>> cluster.granular-entry-heal: enable >>> cluster.data-self-heal-algorithm: full >>> >>> Sincerely, >>> Artem >>> >>> -- >>> Founder, Android Police <http://www.androidpolice.com>, APK Mirror >>> <http://www.apkmirror.com/>, Illogical Robot LLC >>> beerpla.net | +ArtemRussakovskii >>> <https://plus.google.com/+ArtemRussakovskii> | @ArtemR >>> <http://twitter.com/ArtemR> >>> >>> >>> On Wed, Feb 6, 2019 at 12:20 AM Nithya Balachandran <nbalacha at redhat.com> >>> wrote: >>> >>>> Hi Artem, >>>> >>>> Do you still see the crashes with 5.3? If yes, please try mount the >>>> volume using the mount option lru-limit=0 and see if that helps. We are >>>> looking into the crashes and will update when have a fix. >>>> >>>> Also, please provide the gluster volume info for the volume in question. >>>> >>>> >>>> regards, >>>> Nithya >>>> >>>> On Tue, 5 Feb 2019 at 05:31, Artem Russakovskii <archon810 at gmail.com> >>>> wrote: >>>> >>>>> The fuse crash happened two more times, but this time monit helped >>>>> recover within 1 minute, so it's a great workaround for now. >>>>> >>>>> What's odd is that the crashes are only happening on one of 4 servers, >>>>> and I don't know why. >>>>> >>>>> Sincerely, >>>>> Artem >>>>> >>>>> -- >>>>> Founder, Android Police <http://www.androidpolice.com>, APK Mirror >>>>> <http://www.apkmirror.com/>, Illogical Robot LLC >>>>> beerpla.net | +ArtemRussakovskii >>>>> <https://plus.google.com/+ArtemRussakovskii> | @ArtemR >>>>> <http://twitter.com/ArtemR> >>>>> >>>>> >>>>> On Sat, Feb 2, 2019 at 12:14 PM Artem Russakovskii < >>>>> archon810 at gmail.com> wrote: >>>>> >>>>>> The fuse crash happened again yesterday, to another volume. Are there >>>>>> any mount options that could help mitigate this? >>>>>> >>>>>> In the meantime, I set up a monit (https://mmonit.com/monit/) task >>>>>> to watch and restart the mount, which works and recovers the mount point >>>>>> within a minute. Not ideal, but a temporary workaround. >>>>>> >>>>>> By the way, the way to reproduce this "Transport endpoint is not >>>>>> connected" condition for testing purposes is to kill -9 the right >>>>>> "glusterfs --process-name fuse" process. >>>>>> >>>>>> >>>>>> monit check: >>>>>> check filesystem glusterfs_data1 with path /mnt/glusterfs_data1 >>>>>> start program = "/bin/mount /mnt/glusterfs_data1" >>>>>> stop program = "/bin/umount /mnt/glusterfs_data1" >>>>>> if space usage > 90% for 5 times within 15 cycles >>>>>> then alert else if succeeded for 10 cycles then alert >>>>>> >>>>>> >>>>>> stack trace: >>>>>> [2019-02-01 23:22:00.312894] W [dict.c:761:dict_ref] >>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>> [0x7fa0249e4329] >>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>> [0x7fa024bf5af5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>> [0x7fa02cf5b218] ) 0-dict: dict is NULL [Invalid argument] >>>>>> [2019-02-01 23:22:00.314051] W [dict.c:761:dict_ref] >>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>> [0x7fa0249e4329] >>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>> [0x7fa024bf5af5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>> [0x7fa02cf5b218] ) 0-dict: dict is NULL [Invalid argument] >>>>>> The message "E [MSGID: 101191] >>>>>> [event-epoll.c:671:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch >>>>>> handler" repeated 26 times between [2019-02-01 23:21:20.857333] and >>>>>> [2019-02-01 23:21:56.164427] >>>>>> The message "I [MSGID: 108031] >>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 0-SITE_data3-replicate-0: >>>>>> selecting local read_child SITE_data3-client-3" repeated 27 times between >>>>>> [2019-02-01 23:21:11.142467] and [2019-02-01 23:22:03.474036] >>>>>> pending frames: >>>>>> frame : type(1) op(LOOKUP) >>>>>> frame : type(0) op(0) >>>>>> patchset: git://git.gluster.org/glusterfs.git >>>>>> signal received: 6 >>>>>> time of crash: >>>>>> 2019-02-01 23:22:03 >>>>>> configuration details: >>>>>> argp 1 >>>>>> backtrace 1 >>>>>> dlfcn 1 >>>>>> libpthread 1 >>>>>> llistxattr 1 >>>>>> setfsid 1 >>>>>> spinlock 1 >>>>>> epoll.h 1 >>>>>> xattr.h 1 >>>>>> st_atim.tv_nsec 1 >>>>>> package-string: glusterfs 5.3 >>>>>> /usr/lib64/libglusterfs.so.0(+0x2764c)[0x7fa02cf6664c] >>>>>> /usr/lib64/libglusterfs.so.0(gf_print_trace+0x306)[0x7fa02cf70cb6] >>>>>> /lib64/libc.so.6(+0x36160)[0x7fa02c12d160] >>>>>> /lib64/libc.so.6(gsignal+0x110)[0x7fa02c12d0e0] >>>>>> /lib64/libc.so.6(abort+0x151)[0x7fa02c12e6c1] >>>>>> /lib64/libc.so.6(+0x2e6fa)[0x7fa02c1256fa] >>>>>> /lib64/libc.so.6(+0x2e772)[0x7fa02c125772] >>>>>> /lib64/libpthread.so.0(pthread_mutex_lock+0x228)[0x7fa02c4bb0b8] >>>>>> >>>>>> /usr/lib64/glusterfs/5.3/xlator/cluster/replicate.so(+0x5dc9d)[0x7fa025543c9d] >>>>>> >>>>>> /usr/lib64/glusterfs/5.3/xlator/cluster/replicate.so(+0x70ba1)[0x7fa025556ba1] >>>>>> >>>>>> /usr/lib64/glusterfs/5.3/xlator/protocol/client.so(+0x58f3f)[0x7fa0257dbf3f] >>>>>> /usr/lib64/libgfrpc.so.0(+0xe820)[0x7fa02cd31820] >>>>>> /usr/lib64/libgfrpc.so.0(+0xeb6f)[0x7fa02cd31b6f] >>>>>> /usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7fa02cd2e063] >>>>>> >>>>>> /usr/lib64/glusterfs/5.3/rpc-transport/socket.so(+0xa0b2)[0x7fa02694e0b2] >>>>>> /usr/lib64/libglusterfs.so.0(+0x854c3)[0x7fa02cfc44c3] >>>>>> /lib64/libpthread.so.0(+0x7559)[0x7fa02c4b8559] >>>>>> /lib64/libc.so.6(clone+0x3f)[0x7fa02c1ef81f] >>>>>> >>>>>> Sincerely, >>>>>> Artem >>>>>> >>>>>> -- >>>>>> Founder, Android Police <http://www.androidpolice.com>, APK Mirror >>>>>> <http://www.apkmirror.com/>, Illogical Robot LLC >>>>>> beerpla.net | +ArtemRussakovskii >>>>>> <https://plus.google.com/+ArtemRussakovskii> | @ArtemR >>>>>> <http://twitter.com/ArtemR> >>>>>> >>>>>> >>>>>> On Fri, Feb 1, 2019 at 9:03 AM Artem Russakovskii < >>>>>> archon810 at gmail.com> wrote: >>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> The first (and so far only) crash happened at 2am the next day after >>>>>>> we upgraded, on only one of four servers and only to one of two mounts. >>>>>>> >>>>>>> I have no idea what caused it, but yeah, we do have a pretty busy >>>>>>> site (apkmirror.com), and it caused a disruption for any uploads or >>>>>>> downloads from that server until I woke up and fixed the mount. >>>>>>> >>>>>>> I wish I could be more helpful but all I have is that stack trace. >>>>>>> >>>>>>> I'm glad it's a blocker and will hopefully be resolved soon. >>>>>>> >>>>>>> On Thu, Jan 31, 2019, 7:26 PM Amar Tumballi Suryanarayan < >>>>>>> atumball at redhat.com> wrote: >>>>>>> >>>>>>>> Hi Artem, >>>>>>>> >>>>>>>> Opened https://bugzilla.redhat.com/show_bug.cgi?id=1671603 (ie, as >>>>>>>> a clone of other bugs where recent discussions happened), and marked it as >>>>>>>> a blocker for glusterfs-5.4 release. >>>>>>>> >>>>>>>> We already have fixes for log flooding - >>>>>>>> https://review.gluster.org/22128, and are the process of >>>>>>>> identifying and fixing the issue seen with crash. >>>>>>>> >>>>>>>> Can you please tell if the crashes happened as soon as upgrade ? or >>>>>>>> was there any particular pattern you observed before the crash. >>>>>>>> >>>>>>>> -Amar >>>>>>>> >>>>>>>> >>>>>>>> On Thu, Jan 31, 2019 at 11:40 PM Artem Russakovskii < >>>>>>>> archon810 at gmail.com> wrote: >>>>>>>> >>>>>>>>> Within 24 hours after updating from rock solid 4.1 to 5.3, I >>>>>>>>> already got a crash which others have mentioned in >>>>>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1313567 and had to >>>>>>>>> unmount, kill gluster, and remount: >>>>>>>>> >>>>>>>>> >>>>>>>>> [2019-01-31 09:38:04.317604] W [dict.c:761:dict_ref] >>>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>>> [0x7fcccafcd329] >>>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>>> [0x7fcccb1deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>>> [0x7fccd705b218] ) 2-dict: dict is NULL [Invalid argument] >>>>>>>>> [2019-01-31 09:38:04.319308] W [dict.c:761:dict_ref] >>>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>>> [0x7fcccafcd329] >>>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>>> [0x7fcccb1deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>>> [0x7fccd705b218] ) 2-dict: dict is NULL [Invalid argument] >>>>>>>>> [2019-01-31 09:38:04.320047] W [dict.c:761:dict_ref] >>>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>>> [0x7fcccafcd329] >>>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>>> [0x7fcccb1deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>>> [0x7fccd705b218] ) 2-dict: dict is NULL [Invalid argument] >>>>>>>>> [2019-01-31 09:38:04.320677] W [dict.c:761:dict_ref] >>>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>>> [0x7fcccafcd329] >>>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>>> [0x7fcccb1deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>>> [0x7fccd705b218] ) 2-dict: dict is NULL [Invalid argument] >>>>>>>>> The message "I [MSGID: 108031] >>>>>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data1-replicate-0: >>>>>>>>> selecting local read_child SITE_data1-client-3" repeated 5 times between >>>>>>>>> [2019-01-31 09:37:54.751905] and [2019-01-31 09:38:03.958061] >>>>>>>>> The message "E [MSGID: 101191] >>>>>>>>> [event-epoll.c:671:event_dispatch_epoll_worker] 2-epoll: Failed to dispatch >>>>>>>>> handler" repeated 72 times between [2019-01-31 09:37:53.746741] and >>>>>>>>> [2019-01-31 09:38:04.696993] >>>>>>>>> pending frames: >>>>>>>>> frame : type(1) op(READ) >>>>>>>>> frame : type(1) op(OPEN) >>>>>>>>> frame : type(0) op(0) >>>>>>>>> patchset: git://git.gluster.org/glusterfs.git >>>>>>>>> signal received: 6 >>>>>>>>> time of crash: >>>>>>>>> 2019-01-31 09:38:04 >>>>>>>>> configuration details: >>>>>>>>> argp 1 >>>>>>>>> backtrace 1 >>>>>>>>> dlfcn 1 >>>>>>>>> libpthread 1 >>>>>>>>> llistxattr 1 >>>>>>>>> setfsid 1 >>>>>>>>> spinlock 1 >>>>>>>>> epoll.h 1 >>>>>>>>> xattr.h 1 >>>>>>>>> st_atim.tv_nsec 1 >>>>>>>>> package-string: glusterfs 5.3 >>>>>>>>> /usr/lib64/libglusterfs.so.0(+0x2764c)[0x7fccd706664c] >>>>>>>>> /usr/lib64/libglusterfs.so.0(gf_print_trace+0x306)[0x7fccd7070cb6] >>>>>>>>> /lib64/libc.so.6(+0x36160)[0x7fccd622d160] >>>>>>>>> /lib64/libc.so.6(gsignal+0x110)[0x7fccd622d0e0] >>>>>>>>> /lib64/libc.so.6(abort+0x151)[0x7fccd622e6c1] >>>>>>>>> /lib64/libc.so.6(+0x2e6fa)[0x7fccd62256fa] >>>>>>>>> /lib64/libc.so.6(+0x2e772)[0x7fccd6225772] >>>>>>>>> /lib64/libpthread.so.0(pthread_mutex_lock+0x228)[0x7fccd65bb0b8] >>>>>>>>> >>>>>>>>> /usr/lib64/glusterfs/5.3/xlator/cluster/replicate.so(+0x32c4d)[0x7fcccbb01c4d] >>>>>>>>> >>>>>>>>> /usr/lib64/glusterfs/5.3/xlator/protocol/client.so(+0x65778)[0x7fcccbdd1778] >>>>>>>>> /usr/lib64/libgfrpc.so.0(+0xe820)[0x7fccd6e31820] >>>>>>>>> /usr/lib64/libgfrpc.so.0(+0xeb6f)[0x7fccd6e31b6f] >>>>>>>>> /usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7fccd6e2e063] >>>>>>>>> >>>>>>>>> /usr/lib64/glusterfs/5.3/rpc-transport/socket.so(+0xa0b2)[0x7fccd0b7e0b2] >>>>>>>>> /usr/lib64/libglusterfs.so.0(+0x854c3)[0x7fccd70c44c3] >>>>>>>>> /lib64/libpthread.so.0(+0x7559)[0x7fccd65b8559] >>>>>>>>> /lib64/libc.so.6(clone+0x3f)[0x7fccd62ef81f] >>>>>>>>> --------- >>>>>>>>> >>>>>>>>> Do the pending patches fix the crash or only the repeated >>>>>>>>> warnings? I'm running glusterfs on OpenSUSE 15.0 installed via >>>>>>>>> http://download.opensuse.org/repositories/home:/glusterfs:/Leap15-5/openSUSE_Leap_15.0/, >>>>>>>>> not too sure how to make it core dump. >>>>>>>>> >>>>>>>>> If it's not fixed by the patches above, has anyone already opened >>>>>>>>> a ticket for the crashes that I can join and monitor? This is going to >>>>>>>>> create a massive problem for us since production systems are crashing. >>>>>>>>> >>>>>>>>> Thanks. >>>>>>>>> >>>>>>>>> Sincerely, >>>>>>>>> Artem >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Founder, Android Police <http://www.androidpolice.com>, APK Mirror >>>>>>>>> <http://www.apkmirror.com/>, Illogical Robot LLC >>>>>>>>> beerpla.net | +ArtemRussakovskii >>>>>>>>> <https://plus.google.com/+ArtemRussakovskii> | @ArtemR >>>>>>>>> <http://twitter.com/ArtemR> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Wed, Jan 30, 2019 at 6:37 PM Raghavendra Gowdappa < >>>>>>>>> rgowdapp at redhat.com> wrote: >>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Thu, Jan 31, 2019 at 2:14 AM Artem Russakovskii < >>>>>>>>>> archon810 at gmail.com> wrote: >>>>>>>>>> >>>>>>>>>>> Also, not sure if related or not, but I got a ton of these >>>>>>>>>>> "Failed to dispatch handler" in my logs as well. Many people have been >>>>>>>>>>> commenting about this issue here >>>>>>>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1651246. >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> https://review.gluster.org/#/c/glusterfs/+/22046/ addresses this. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> ==> mnt-SITE_data1.log <=>>>>>>>>>>>> [2019-01-30 20:38:20.783713] W [dict.c:761:dict_ref] >>>>>>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>>>>>> [0x7fd966fcd329] >>>>>>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>>>>>> [0x7fd9671deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>>>>>> [0x7fd9731ea218] ) 2-dict: dict is NULL [Invalid argument] >>>>>>>>>>>> ==> mnt-SITE_data3.log <=>>>>>>>>>>>> The message "E [MSGID: 101191] >>>>>>>>>>>> [event-epoll.c:671:event_dispatch_epoll_worker] 2-epoll: Failed to dispatch >>>>>>>>>>>> handler" repeated 413 times between [2019-01-30 20:36:23.881090] and >>>>>>>>>>>> [2019-01-30 20:38:20.015593] >>>>>>>>>>>> The message "I [MSGID: 108031] >>>>>>>>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data3-replicate-0: >>>>>>>>>>>> selecting local read_child SITE_data3-client-0" repeated 42 times between >>>>>>>>>>>> [2019-01-30 20:36:23.290287] and [2019-01-30 20:38:20.280306] >>>>>>>>>>>> ==> mnt-SITE_data1.log <=>>>>>>>>>>>> The message "I [MSGID: 108031] >>>>>>>>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data1-replicate-0: >>>>>>>>>>>> selecting local read_child SITE_data1-client-0" repeated 50 times between >>>>>>>>>>>> [2019-01-30 20:36:22.247367] and [2019-01-30 20:38:19.459789] >>>>>>>>>>>> The message "E [MSGID: 101191] >>>>>>>>>>>> [event-epoll.c:671:event_dispatch_epoll_worker] 2-epoll: Failed to dispatch >>>>>>>>>>>> handler" repeated 2654 times between [2019-01-30 20:36:22.667327] and >>>>>>>>>>>> [2019-01-30 20:38:20.546355] >>>>>>>>>>>> [2019-01-30 20:38:21.492319] I [MSGID: 108031] >>>>>>>>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data1-replicate-0: >>>>>>>>>>>> selecting local read_child SITE_data1-client-0 >>>>>>>>>>>> ==> mnt-SITE_data3.log <=>>>>>>>>>>>> [2019-01-30 20:38:22.349689] I [MSGID: 108031] >>>>>>>>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data3-replicate-0: >>>>>>>>>>>> selecting local read_child SITE_data3-client-0 >>>>>>>>>>>> ==> mnt-SITE_data1.log <=>>>>>>>>>>>> [2019-01-30 20:38:22.762941] E [MSGID: 101191] >>>>>>>>>>>> [event-epoll.c:671:event_dispatch_epoll_worker] 2-epoll: Failed to dispatch >>>>>>>>>>>> handler >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> I'm hoping raising the issue here on the mailing list may bring >>>>>>>>>>> some additional eyeballs and get them both fixed. >>>>>>>>>>> >>>>>>>>>>> Thanks. >>>>>>>>>>> >>>>>>>>>>> Sincerely, >>>>>>>>>>> Artem >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> Founder, Android Police <http://www.androidpolice.com>, APK >>>>>>>>>>> Mirror <http://www.apkmirror.com/>, Illogical Robot LLC >>>>>>>>>>> beerpla.net | +ArtemRussakovskii >>>>>>>>>>> <https://plus.google.com/+ArtemRussakovskii> | @ArtemR >>>>>>>>>>> <http://twitter.com/ArtemR> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Wed, Jan 30, 2019 at 12:26 PM Artem Russakovskii < >>>>>>>>>>> archon810 at gmail.com> wrote: >>>>>>>>>>> >>>>>>>>>>>> I found a similar issue here: >>>>>>>>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1313567. There's a >>>>>>>>>>>> comment from 3 days ago from someone else with 5.3 who started seeing the >>>>>>>>>>>> spam. >>>>>>>>>>>> >>>>>>>>>>>> Here's the command that repeats over and over: >>>>>>>>>>>> [2019-01-30 20:23:24.481581] W [dict.c:761:dict_ref] >>>>>>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>>>>>> [0x7fd966fcd329] >>>>>>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>>>>>> [0x7fd9671deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>>>>>> [0x7fd9731ea218] ) 2-dict: dict is NULL [Invalid argument] >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> +Milind Changire <mchangir at redhat.com> Can you check why this >>>>>>>>>> message is logged and send a fix? >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>>> Is there any fix for this issue? >>>>>>>>>>>> >>>>>>>>>>>> Thanks. >>>>>>>>>>>> >>>>>>>>>>>> Sincerely, >>>>>>>>>>>> Artem >>>>>>>>>>>> >>>>>>>>>>>> -- >>>>>>>>>>>> Founder, Android Police <http://www.androidpolice.com>, APK >>>>>>>>>>>> Mirror <http://www.apkmirror.com/>, Illogical Robot LLC >>>>>>>>>>>> beerpla.net | +ArtemRussakovskii >>>>>>>>>>>> <https://plus.google.com/+ArtemRussakovskii> | @ArtemR >>>>>>>>>>>> <http://twitter.com/ArtemR> >>>>>>>>>>>> >>>>>>>>>>> _______________________________________________ >>>>>>>>>>> Gluster-users mailing list >>>>>>>>>>> Gluster-users at gluster.org >>>>>>>>>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>>>>>>>>> >>>>>>>>>> _______________________________________________ >>>>>>>>> Gluster-users mailing list >>>>>>>>> Gluster-users at gluster.org >>>>>>>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Amar Tumballi (amarts) >>>>>>>> >>>>>>> _______________________________________________ >>>>> Gluster-users mailing list >>>>> Gluster-users at gluster.org >>>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>>> >>>>-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20190208/78ac2ce3/attachment.html>
Raghavendra Gowdappa
2019-Feb-08 03:18 UTC
[Gluster-users] Message repeated over and over after upgrade from 4.1 to 5.3: W [dict.c:761:dict_ref] (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) [0x7fd966fcd329] -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) [0x7fd9671deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) [0x7fd9731ea218] ) 2-dict: dict is NULL [Invalid argument]
One possible reason could be https://review.gluster.org/r/18b6d7ce7d490e807815270918a17a4b392a829d as that changed some code in epoll handler. Though the change is largely on server side, the epoll and socket changes are relevant for client too. I'll try to see whether there is anything wrong with that. On Fri, Feb 8, 2019 at 8:36 AM Nithya Balachandran <nbalacha at redhat.com> wrote:> Thanks Artem. Can you send us the coredump or the bt with symbols from the > crash? > > Regards, > Nithya > > On Fri, 8 Feb 2019 at 06:51, Artem Russakovskii <archon810 at gmail.com> > wrote: > >> Sorry to disappoint, but the crash just happened again, so lru-limit=0 >> didn't help. >> >> Here's the snippet of the crash and the subsequent remount by monit. >> >> >> [2019-02-08 01:13:05.854391] W [dict.c:761:dict_ref] >> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >> [0x7f4402b99329] >> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >> [0x7f4402daaaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >> [0x7f440b6b5218] ) 0-dict: dict is NULL [In >> valid argument] >> The message "I [MSGID: 108031] >> [afr-common.c:2543:afr_local_discovery_cbk] 0-<SNIP>_data1-replicate-0: >> selecting local read_child <SNIP>_data1-client-3" repeated 39 times between >> [2019-02-08 01:11:18.043286] and [2019-02-08 01:13:07.915604] >> The message "E [MSGID: 101191] >> [event-epoll.c:671:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch >> handler" repeated 515 times between [2019-02-08 01:11:17.932515] and >> [2019-02-08 01:13:09.311554] >> pending frames: >> frame : type(1) op(LOOKUP) >> frame : type(0) op(0) >> patchset: git://git.gluster.org/glusterfs.git >> signal received: 6 >> time of crash: >> 2019-02-08 01:13:09 >> configuration details: >> argp 1 >> backtrace 1 >> dlfcn 1 >> libpthread 1 >> llistxattr 1 >> setfsid 1 >> spinlock 1 >> epoll.h 1 >> xattr.h 1 >> st_atim.tv_nsec 1 >> package-string: glusterfs 5.3 >> /usr/lib64/libglusterfs.so.0(+0x2764c)[0x7f440b6c064c] >> /usr/lib64/libglusterfs.so.0(gf_print_trace+0x306)[0x7f440b6cacb6] >> /lib64/libc.so.6(+0x36160)[0x7f440a887160] >> /lib64/libc.so.6(gsignal+0x110)[0x7f440a8870e0] >> /lib64/libc.so.6(abort+0x151)[0x7f440a8886c1] >> /lib64/libc.so.6(+0x2e6fa)[0x7f440a87f6fa] >> /lib64/libc.so.6(+0x2e772)[0x7f440a87f772] >> /lib64/libpthread.so.0(pthread_mutex_lock+0x228)[0x7f440ac150b8] >> >> /usr/lib64/glusterfs/5.3/xlator/cluster/replicate.so(+0x5dc9d)[0x7f44036f8c9d] >> >> /usr/lib64/glusterfs/5.3/xlator/cluster/replicate.so(+0x70ba1)[0x7f440370bba1] >> >> /usr/lib64/glusterfs/5.3/xlator/protocol/client.so(+0x58f3f)[0x7f4403990f3f] >> /usr/lib64/libgfrpc.so.0(+0xe820)[0x7f440b48b820] >> /usr/lib64/libgfrpc.so.0(+0xeb6f)[0x7f440b48bb6f] >> /usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7f440b488063] >> /usr/lib64/glusterfs/5.3/rpc-transport/socket.so(+0xa0b2)[0x7f44050a80b2] >> /usr/lib64/libglusterfs.so.0(+0x854c3)[0x7f440b71e4c3] >> /lib64/libpthread.so.0(+0x7559)[0x7f440ac12559] >> /lib64/libc.so.6(clone+0x3f)[0x7f440a94981f] >> --------- >> [2019-02-08 01:13:35.628478] I [MSGID: 100030] [glusterfsd.c:2715:main] >> 0-/usr/sbin/glusterfs: Started running /usr/sbin/glusterfs version 5.3 >> (args: /usr/sbin/glusterfs --lru-limit=0 --process-name fuse >> --volfile-server=localhost --volfile-id=/<SNIP>_data1 /mnt/<SNIP>_data1) >> [2019-02-08 01:13:35.637830] I [MSGID: 101190] >> [event-epoll.c:622:event_dispatch_epoll_worker] 0-epoll: Started thread >> with index 1 >> [2019-02-08 01:13:35.651405] I [MSGID: 101190] >> [event-epoll.c:622:event_dispatch_epoll_worker] 0-epoll: Started thread >> with index 2 >> [2019-02-08 01:13:35.651628] I [MSGID: 101190] >> [event-epoll.c:622:event_dispatch_epoll_worker] 0-epoll: Started thread >> with index 3 >> [2019-02-08 01:13:35.651747] I [MSGID: 101190] >> [event-epoll.c:622:event_dispatch_epoll_worker] 0-epoll: Started thread >> with index 4 >> [2019-02-08 01:13:35.652575] I [MSGID: 114020] [client.c:2354:notify] >> 0-<SNIP>_data1-client-0: parent translators are ready, attempting connect >> on transport >> [2019-02-08 01:13:35.652978] I [MSGID: 114020] [client.c:2354:notify] >> 0-<SNIP>_data1-client-1: parent translators are ready, attempting connect >> on transport >> [2019-02-08 01:13:35.655197] I [MSGID: 114020] [client.c:2354:notify] >> 0-<SNIP>_data1-client-2: parent translators are ready, attempting connect >> on transport >> [2019-02-08 01:13:35.655497] I [MSGID: 114020] [client.c:2354:notify] >> 0-<SNIP>_data1-client-3: parent translators are ready, attempting connect >> on transport >> [2019-02-08 01:13:35.655527] I [rpc-clnt.c:2042:rpc_clnt_reconfig] >> 0-<SNIP>_data1-client-0: changing port to 49153 (from 0) >> Final graph: >> >> >> Sincerely, >> Artem >> >> -- >> Founder, Android Police <http://www.androidpolice.com>, APK Mirror >> <http://www.apkmirror.com/>, Illogical Robot LLC >> beerpla.net | +ArtemRussakovskii >> <https://plus.google.com/+ArtemRussakovskii> | @ArtemR >> <http://twitter.com/ArtemR> >> >> >> On Thu, Feb 7, 2019 at 1:28 PM Artem Russakovskii <archon810 at gmail.com> >> wrote: >> >>> I've added the lru-limit=0 parameter to the mounts, and I see it's taken >>> effect correctly: >>> "/usr/sbin/glusterfs --lru-limit=0 --process-name fuse >>> --volfile-server=localhost --volfile-id=/<SNIP> /mnt/<SNIP>" >>> >>> Let's see if it stops crashing or not. >>> >>> Sincerely, >>> Artem >>> >>> -- >>> Founder, Android Police <http://www.androidpolice.com>, APK Mirror >>> <http://www.apkmirror.com/>, Illogical Robot LLC >>> beerpla.net | +ArtemRussakovskii >>> <https://plus.google.com/+ArtemRussakovskii> | @ArtemR >>> <http://twitter.com/ArtemR> >>> >>> >>> On Wed, Feb 6, 2019 at 10:48 AM Artem Russakovskii <archon810 at gmail.com> >>> wrote: >>> >>>> Hi Nithya, >>>> >>>> Indeed, I upgraded from 4.1 to 5.3, at which point I started seeing >>>> crashes, and no further releases have been made yet. >>>> >>>> volume info: >>>> Type: Replicate >>>> Volume ID: ****SNIP**** >>>> Status: Started >>>> Snapshot Count: 0 >>>> Number of Bricks: 1 x 4 = 4 >>>> Transport-type: tcp >>>> Bricks: >>>> Brick1: ****SNIP**** >>>> Brick2: ****SNIP**** >>>> Brick3: ****SNIP**** >>>> Brick4: ****SNIP**** >>>> Options Reconfigured: >>>> cluster.quorum-count: 1 >>>> cluster.quorum-type: fixed >>>> network.ping-timeout: 5 >>>> network.remote-dio: enable >>>> performance.rda-cache-limit: 256MB >>>> performance.readdir-ahead: on >>>> performance.parallel-readdir: on >>>> network.inode-lru-limit: 500000 >>>> performance.md-cache-timeout: 600 >>>> performance.cache-invalidation: on >>>> performance.stat-prefetch: on >>>> features.cache-invalidation-timeout: 600 >>>> features.cache-invalidation: on >>>> cluster.readdir-optimize: on >>>> performance.io-thread-count: 32 >>>> server.event-threads: 4 >>>> client.event-threads: 4 >>>> performance.read-ahead: off >>>> cluster.lookup-optimize: on >>>> performance.cache-size: 1GB >>>> cluster.self-heal-daemon: enable >>>> transport.address-family: inet >>>> nfs.disable: on >>>> performance.client-io-threads: on >>>> cluster.granular-entry-heal: enable >>>> cluster.data-self-heal-algorithm: full >>>> >>>> Sincerely, >>>> Artem >>>> >>>> -- >>>> Founder, Android Police <http://www.androidpolice.com>, APK Mirror >>>> <http://www.apkmirror.com/>, Illogical Robot LLC >>>> beerpla.net | +ArtemRussakovskii >>>> <https://plus.google.com/+ArtemRussakovskii> | @ArtemR >>>> <http://twitter.com/ArtemR> >>>> >>>> >>>> On Wed, Feb 6, 2019 at 12:20 AM Nithya Balachandran < >>>> nbalacha at redhat.com> wrote: >>>> >>>>> Hi Artem, >>>>> >>>>> Do you still see the crashes with 5.3? If yes, please try mount the >>>>> volume using the mount option lru-limit=0 and see if that helps. We are >>>>> looking into the crashes and will update when have a fix. >>>>> >>>>> Also, please provide the gluster volume info for the volume in >>>>> question. >>>>> >>>>> >>>>> regards, >>>>> Nithya >>>>> >>>>> On Tue, 5 Feb 2019 at 05:31, Artem Russakovskii <archon810 at gmail.com> >>>>> wrote: >>>>> >>>>>> The fuse crash happened two more times, but this time monit helped >>>>>> recover within 1 minute, so it's a great workaround for now. >>>>>> >>>>>> What's odd is that the crashes are only happening on one of 4 >>>>>> servers, and I don't know why. >>>>>> >>>>>> Sincerely, >>>>>> Artem >>>>>> >>>>>> -- >>>>>> Founder, Android Police <http://www.androidpolice.com>, APK Mirror >>>>>> <http://www.apkmirror.com/>, Illogical Robot LLC >>>>>> beerpla.net | +ArtemRussakovskii >>>>>> <https://plus.google.com/+ArtemRussakovskii> | @ArtemR >>>>>> <http://twitter.com/ArtemR> >>>>>> >>>>>> >>>>>> On Sat, Feb 2, 2019 at 12:14 PM Artem Russakovskii < >>>>>> archon810 at gmail.com> wrote: >>>>>> >>>>>>> The fuse crash happened again yesterday, to another volume. Are >>>>>>> there any mount options that could help mitigate this? >>>>>>> >>>>>>> In the meantime, I set up a monit (https://mmonit.com/monit/) task >>>>>>> to watch and restart the mount, which works and recovers the mount point >>>>>>> within a minute. Not ideal, but a temporary workaround. >>>>>>> >>>>>>> By the way, the way to reproduce this "Transport endpoint is not >>>>>>> connected" condition for testing purposes is to kill -9 the right >>>>>>> "glusterfs --process-name fuse" process. >>>>>>> >>>>>>> >>>>>>> monit check: >>>>>>> check filesystem glusterfs_data1 with path /mnt/glusterfs_data1 >>>>>>> start program = "/bin/mount /mnt/glusterfs_data1" >>>>>>> stop program = "/bin/umount /mnt/glusterfs_data1" >>>>>>> if space usage > 90% for 5 times within 15 cycles >>>>>>> then alert else if succeeded for 10 cycles then alert >>>>>>> >>>>>>> >>>>>>> stack trace: >>>>>>> [2019-02-01 23:22:00.312894] W [dict.c:761:dict_ref] >>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>> [0x7fa0249e4329] >>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>> [0x7fa024bf5af5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>> [0x7fa02cf5b218] ) 0-dict: dict is NULL [Invalid argument] >>>>>>> [2019-02-01 23:22:00.314051] W [dict.c:761:dict_ref] >>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>> [0x7fa0249e4329] >>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>> [0x7fa024bf5af5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>> [0x7fa02cf5b218] ) 0-dict: dict is NULL [Invalid argument] >>>>>>> The message "E [MSGID: 101191] >>>>>>> [event-epoll.c:671:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch >>>>>>> handler" repeated 26 times between [2019-02-01 23:21:20.857333] and >>>>>>> [2019-02-01 23:21:56.164427] >>>>>>> The message "I [MSGID: 108031] >>>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 0-SITE_data3-replicate-0: >>>>>>> selecting local read_child SITE_data3-client-3" repeated 27 times between >>>>>>> [2019-02-01 23:21:11.142467] and [2019-02-01 23:22:03.474036] >>>>>>> pending frames: >>>>>>> frame : type(1) op(LOOKUP) >>>>>>> frame : type(0) op(0) >>>>>>> patchset: git://git.gluster.org/glusterfs.git >>>>>>> signal received: 6 >>>>>>> time of crash: >>>>>>> 2019-02-01 23:22:03 >>>>>>> configuration details: >>>>>>> argp 1 >>>>>>> backtrace 1 >>>>>>> dlfcn 1 >>>>>>> libpthread 1 >>>>>>> llistxattr 1 >>>>>>> setfsid 1 >>>>>>> spinlock 1 >>>>>>> epoll.h 1 >>>>>>> xattr.h 1 >>>>>>> st_atim.tv_nsec 1 >>>>>>> package-string: glusterfs 5.3 >>>>>>> /usr/lib64/libglusterfs.so.0(+0x2764c)[0x7fa02cf6664c] >>>>>>> /usr/lib64/libglusterfs.so.0(gf_print_trace+0x306)[0x7fa02cf70cb6] >>>>>>> /lib64/libc.so.6(+0x36160)[0x7fa02c12d160] >>>>>>> /lib64/libc.so.6(gsignal+0x110)[0x7fa02c12d0e0] >>>>>>> /lib64/libc.so.6(abort+0x151)[0x7fa02c12e6c1] >>>>>>> /lib64/libc.so.6(+0x2e6fa)[0x7fa02c1256fa] >>>>>>> /lib64/libc.so.6(+0x2e772)[0x7fa02c125772] >>>>>>> /lib64/libpthread.so.0(pthread_mutex_lock+0x228)[0x7fa02c4bb0b8] >>>>>>> >>>>>>> /usr/lib64/glusterfs/5.3/xlator/cluster/replicate.so(+0x5dc9d)[0x7fa025543c9d] >>>>>>> >>>>>>> /usr/lib64/glusterfs/5.3/xlator/cluster/replicate.so(+0x70ba1)[0x7fa025556ba1] >>>>>>> >>>>>>> /usr/lib64/glusterfs/5.3/xlator/protocol/client.so(+0x58f3f)[0x7fa0257dbf3f] >>>>>>> /usr/lib64/libgfrpc.so.0(+0xe820)[0x7fa02cd31820] >>>>>>> /usr/lib64/libgfrpc.so.0(+0xeb6f)[0x7fa02cd31b6f] >>>>>>> /usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7fa02cd2e063] >>>>>>> >>>>>>> /usr/lib64/glusterfs/5.3/rpc-transport/socket.so(+0xa0b2)[0x7fa02694e0b2] >>>>>>> /usr/lib64/libglusterfs.so.0(+0x854c3)[0x7fa02cfc44c3] >>>>>>> /lib64/libpthread.so.0(+0x7559)[0x7fa02c4b8559] >>>>>>> /lib64/libc.so.6(clone+0x3f)[0x7fa02c1ef81f] >>>>>>> >>>>>>> Sincerely, >>>>>>> Artem >>>>>>> >>>>>>> -- >>>>>>> Founder, Android Police <http://www.androidpolice.com>, APK Mirror >>>>>>> <http://www.apkmirror.com/>, Illogical Robot LLC >>>>>>> beerpla.net | +ArtemRussakovskii >>>>>>> <https://plus.google.com/+ArtemRussakovskii> | @ArtemR >>>>>>> <http://twitter.com/ArtemR> >>>>>>> >>>>>>> >>>>>>> On Fri, Feb 1, 2019 at 9:03 AM Artem Russakovskii < >>>>>>> archon810 at gmail.com> wrote: >>>>>>> >>>>>>>> Hi, >>>>>>>> >>>>>>>> The first (and so far only) crash happened at 2am the next day >>>>>>>> after we upgraded, on only one of four servers and only to one of two >>>>>>>> mounts. >>>>>>>> >>>>>>>> I have no idea what caused it, but yeah, we do have a pretty busy >>>>>>>> site (apkmirror.com), and it caused a disruption for any uploads >>>>>>>> or downloads from that server until I woke up and fixed the mount. >>>>>>>> >>>>>>>> I wish I could be more helpful but all I have is that stack trace. >>>>>>>> >>>>>>>> I'm glad it's a blocker and will hopefully be resolved soon. >>>>>>>> >>>>>>>> On Thu, Jan 31, 2019, 7:26 PM Amar Tumballi Suryanarayan < >>>>>>>> atumball at redhat.com> wrote: >>>>>>>> >>>>>>>>> Hi Artem, >>>>>>>>> >>>>>>>>> Opened https://bugzilla.redhat.com/show_bug.cgi?id=1671603 (ie, >>>>>>>>> as a clone of other bugs where recent discussions happened), and marked it >>>>>>>>> as a blocker for glusterfs-5.4 release. >>>>>>>>> >>>>>>>>> We already have fixes for log flooding - >>>>>>>>> https://review.gluster.org/22128, and are the process of >>>>>>>>> identifying and fixing the issue seen with crash. >>>>>>>>> >>>>>>>>> Can you please tell if the crashes happened as soon as upgrade ? >>>>>>>>> or was there any particular pattern you observed before the crash. >>>>>>>>> >>>>>>>>> -Amar >>>>>>>>> >>>>>>>>> >>>>>>>>> On Thu, Jan 31, 2019 at 11:40 PM Artem Russakovskii < >>>>>>>>> archon810 at gmail.com> wrote: >>>>>>>>> >>>>>>>>>> Within 24 hours after updating from rock solid 4.1 to 5.3, I >>>>>>>>>> already got a crash which others have mentioned in >>>>>>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1313567 and had to >>>>>>>>>> unmount, kill gluster, and remount: >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> [2019-01-31 09:38:04.317604] W [dict.c:761:dict_ref] >>>>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>>>> [0x7fcccafcd329] >>>>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>>>> [0x7fcccb1deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>>>> [0x7fccd705b218] ) 2-dict: dict is NULL [Invalid argument] >>>>>>>>>> [2019-01-31 09:38:04.319308] W [dict.c:761:dict_ref] >>>>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>>>> [0x7fcccafcd329] >>>>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>>>> [0x7fcccb1deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>>>> [0x7fccd705b218] ) 2-dict: dict is NULL [Invalid argument] >>>>>>>>>> [2019-01-31 09:38:04.320047] W [dict.c:761:dict_ref] >>>>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>>>> [0x7fcccafcd329] >>>>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>>>> [0x7fcccb1deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>>>> [0x7fccd705b218] ) 2-dict: dict is NULL [Invalid argument] >>>>>>>>>> [2019-01-31 09:38:04.320677] W [dict.c:761:dict_ref] >>>>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>>>> [0x7fcccafcd329] >>>>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>>>> [0x7fcccb1deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>>>> [0x7fccd705b218] ) 2-dict: dict is NULL [Invalid argument] >>>>>>>>>> The message "I [MSGID: 108031] >>>>>>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data1-replicate-0: >>>>>>>>>> selecting local read_child SITE_data1-client-3" repeated 5 times between >>>>>>>>>> [2019-01-31 09:37:54.751905] and [2019-01-31 09:38:03.958061] >>>>>>>>>> The message "E [MSGID: 101191] >>>>>>>>>> [event-epoll.c:671:event_dispatch_epoll_worker] 2-epoll: Failed to dispatch >>>>>>>>>> handler" repeated 72 times between [2019-01-31 09:37:53.746741] and >>>>>>>>>> [2019-01-31 09:38:04.696993] >>>>>>>>>> pending frames: >>>>>>>>>> frame : type(1) op(READ) >>>>>>>>>> frame : type(1) op(OPEN) >>>>>>>>>> frame : type(0) op(0) >>>>>>>>>> patchset: git://git.gluster.org/glusterfs.git >>>>>>>>>> signal received: 6 >>>>>>>>>> time of crash: >>>>>>>>>> 2019-01-31 09:38:04 >>>>>>>>>> configuration details: >>>>>>>>>> argp 1 >>>>>>>>>> backtrace 1 >>>>>>>>>> dlfcn 1 >>>>>>>>>> libpthread 1 >>>>>>>>>> llistxattr 1 >>>>>>>>>> setfsid 1 >>>>>>>>>> spinlock 1 >>>>>>>>>> epoll.h 1 >>>>>>>>>> xattr.h 1 >>>>>>>>>> st_atim.tv_nsec 1 >>>>>>>>>> package-string: glusterfs 5.3 >>>>>>>>>> /usr/lib64/libglusterfs.so.0(+0x2764c)[0x7fccd706664c] >>>>>>>>>> /usr/lib64/libglusterfs.so.0(gf_print_trace+0x306)[0x7fccd7070cb6] >>>>>>>>>> /lib64/libc.so.6(+0x36160)[0x7fccd622d160] >>>>>>>>>> /lib64/libc.so.6(gsignal+0x110)[0x7fccd622d0e0] >>>>>>>>>> /lib64/libc.so.6(abort+0x151)[0x7fccd622e6c1] >>>>>>>>>> /lib64/libc.so.6(+0x2e6fa)[0x7fccd62256fa] >>>>>>>>>> /lib64/libc.so.6(+0x2e772)[0x7fccd6225772] >>>>>>>>>> /lib64/libpthread.so.0(pthread_mutex_lock+0x228)[0x7fccd65bb0b8] >>>>>>>>>> >>>>>>>>>> /usr/lib64/glusterfs/5.3/xlator/cluster/replicate.so(+0x32c4d)[0x7fcccbb01c4d] >>>>>>>>>> >>>>>>>>>> /usr/lib64/glusterfs/5.3/xlator/protocol/client.so(+0x65778)[0x7fcccbdd1778] >>>>>>>>>> /usr/lib64/libgfrpc.so.0(+0xe820)[0x7fccd6e31820] >>>>>>>>>> /usr/lib64/libgfrpc.so.0(+0xeb6f)[0x7fccd6e31b6f] >>>>>>>>>> >>>>>>>>>> /usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7fccd6e2e063] >>>>>>>>>> >>>>>>>>>> /usr/lib64/glusterfs/5.3/rpc-transport/socket.so(+0xa0b2)[0x7fccd0b7e0b2] >>>>>>>>>> /usr/lib64/libglusterfs.so.0(+0x854c3)[0x7fccd70c44c3] >>>>>>>>>> /lib64/libpthread.so.0(+0x7559)[0x7fccd65b8559] >>>>>>>>>> /lib64/libc.so.6(clone+0x3f)[0x7fccd62ef81f] >>>>>>>>>> --------- >>>>>>>>>> >>>>>>>>>> Do the pending patches fix the crash or only the repeated >>>>>>>>>> warnings? I'm running glusterfs on OpenSUSE 15.0 installed via >>>>>>>>>> http://download.opensuse.org/repositories/home:/glusterfs:/Leap15-5/openSUSE_Leap_15.0/, >>>>>>>>>> not too sure how to make it core dump. >>>>>>>>>> >>>>>>>>>> If it's not fixed by the patches above, has anyone already opened >>>>>>>>>> a ticket for the crashes that I can join and monitor? This is going to >>>>>>>>>> create a massive problem for us since production systems are crashing. >>>>>>>>>> >>>>>>>>>> Thanks. >>>>>>>>>> >>>>>>>>>> Sincerely, >>>>>>>>>> Artem >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> Founder, Android Police <http://www.androidpolice.com>, APK >>>>>>>>>> Mirror <http://www.apkmirror.com/>, Illogical Robot LLC >>>>>>>>>> beerpla.net | +ArtemRussakovskii >>>>>>>>>> <https://plus.google.com/+ArtemRussakovskii> | @ArtemR >>>>>>>>>> <http://twitter.com/ArtemR> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Wed, Jan 30, 2019 at 6:37 PM Raghavendra Gowdappa < >>>>>>>>>> rgowdapp at redhat.com> wrote: >>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Thu, Jan 31, 2019 at 2:14 AM Artem Russakovskii < >>>>>>>>>>> archon810 at gmail.com> wrote: >>>>>>>>>>> >>>>>>>>>>>> Also, not sure if related or not, but I got a ton of these >>>>>>>>>>>> "Failed to dispatch handler" in my logs as well. Many people have been >>>>>>>>>>>> commenting about this issue here >>>>>>>>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1651246. >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> https://review.gluster.org/#/c/glusterfs/+/22046/ addresses >>>>>>>>>>> this. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>> ==> mnt-SITE_data1.log <=>>>>>>>>>>>>> [2019-01-30 20:38:20.783713] W [dict.c:761:dict_ref] >>>>>>>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>>>>>>> [0x7fd966fcd329] >>>>>>>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>>>>>>> [0x7fd9671deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>>>>>>> [0x7fd9731ea218] ) 2-dict: dict is NULL [Invalid argument] >>>>>>>>>>>>> ==> mnt-SITE_data3.log <=>>>>>>>>>>>>> The message "E [MSGID: 101191] >>>>>>>>>>>>> [event-epoll.c:671:event_dispatch_epoll_worker] 2-epoll: Failed to dispatch >>>>>>>>>>>>> handler" repeated 413 times between [2019-01-30 20:36:23.881090] and >>>>>>>>>>>>> [2019-01-30 20:38:20.015593] >>>>>>>>>>>>> The message "I [MSGID: 108031] >>>>>>>>>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data3-replicate-0: >>>>>>>>>>>>> selecting local read_child SITE_data3-client-0" repeated 42 times between >>>>>>>>>>>>> [2019-01-30 20:36:23.290287] and [2019-01-30 20:38:20.280306] >>>>>>>>>>>>> ==> mnt-SITE_data1.log <=>>>>>>>>>>>>> The message "I [MSGID: 108031] >>>>>>>>>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data1-replicate-0: >>>>>>>>>>>>> selecting local read_child SITE_data1-client-0" repeated 50 times between >>>>>>>>>>>>> [2019-01-30 20:36:22.247367] and [2019-01-30 20:38:19.459789] >>>>>>>>>>>>> The message "E [MSGID: 101191] >>>>>>>>>>>>> [event-epoll.c:671:event_dispatch_epoll_worker] 2-epoll: Failed to dispatch >>>>>>>>>>>>> handler" repeated 2654 times between [2019-01-30 20:36:22.667327] and >>>>>>>>>>>>> [2019-01-30 20:38:20.546355] >>>>>>>>>>>>> [2019-01-30 20:38:21.492319] I [MSGID: 108031] >>>>>>>>>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data1-replicate-0: >>>>>>>>>>>>> selecting local read_child SITE_data1-client-0 >>>>>>>>>>>>> ==> mnt-SITE_data3.log <=>>>>>>>>>>>>> [2019-01-30 20:38:22.349689] I [MSGID: 108031] >>>>>>>>>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data3-replicate-0: >>>>>>>>>>>>> selecting local read_child SITE_data3-client-0 >>>>>>>>>>>>> ==> mnt-SITE_data1.log <=>>>>>>>>>>>>> [2019-01-30 20:38:22.762941] E [MSGID: 101191] >>>>>>>>>>>>> [event-epoll.c:671:event_dispatch_epoll_worker] 2-epoll: Failed to dispatch >>>>>>>>>>>>> handler >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> I'm hoping raising the issue here on the mailing list may bring >>>>>>>>>>>> some additional eyeballs and get them both fixed. >>>>>>>>>>>> >>>>>>>>>>>> Thanks. >>>>>>>>>>>> >>>>>>>>>>>> Sincerely, >>>>>>>>>>>> Artem >>>>>>>>>>>> >>>>>>>>>>>> -- >>>>>>>>>>>> Founder, Android Police <http://www.androidpolice.com>, APK >>>>>>>>>>>> Mirror <http://www.apkmirror.com/>, Illogical Robot LLC >>>>>>>>>>>> beerpla.net | +ArtemRussakovskii >>>>>>>>>>>> <https://plus.google.com/+ArtemRussakovskii> | @ArtemR >>>>>>>>>>>> <http://twitter.com/ArtemR> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Wed, Jan 30, 2019 at 12:26 PM Artem Russakovskii < >>>>>>>>>>>> archon810 at gmail.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> I found a similar issue here: >>>>>>>>>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1313567. There's >>>>>>>>>>>>> a comment from 3 days ago from someone else with 5.3 who started seeing the >>>>>>>>>>>>> spam. >>>>>>>>>>>>> >>>>>>>>>>>>> Here's the command that repeats over and over: >>>>>>>>>>>>> [2019-01-30 20:23:24.481581] W [dict.c:761:dict_ref] >>>>>>>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>>>>>>> [0x7fd966fcd329] >>>>>>>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>>>>>>> [0x7fd9671deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>>>>>>> [0x7fd9731ea218] ) 2-dict: dict is NULL [Invalid argument] >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> +Milind Changire <mchangir at redhat.com> Can you check why this >>>>>>>>>>> message is logged and send a fix? >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>>> Is there any fix for this issue? >>>>>>>>>>>>> >>>>>>>>>>>>> Thanks. >>>>>>>>>>>>> >>>>>>>>>>>>> Sincerely, >>>>>>>>>>>>> Artem >>>>>>>>>>>>> >>>>>>>>>>>>> -- >>>>>>>>>>>>> Founder, Android Police <http://www.androidpolice.com>, APK >>>>>>>>>>>>> Mirror <http://www.apkmirror.com/>, Illogical Robot LLC >>>>>>>>>>>>> beerpla.net | +ArtemRussakovskii >>>>>>>>>>>>> <https://plus.google.com/+ArtemRussakovskii> | @ArtemR >>>>>>>>>>>>> <http://twitter.com/ArtemR> >>>>>>>>>>>>> >>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>> Gluster-users mailing list >>>>>>>>>>>> Gluster-users at gluster.org >>>>>>>>>>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>>>>>>>>>> >>>>>>>>>>> _______________________________________________ >>>>>>>>>> Gluster-users mailing list >>>>>>>>>> Gluster-users at gluster.org >>>>>>>>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Amar Tumballi (amarts) >>>>>>>>> >>>>>>>> _______________________________________________ >>>>>> Gluster-users mailing list >>>>>> Gluster-users at gluster.org >>>>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>>>> >>>>> _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20190208/a59ae805/attachment.html>