Artem Russakovskii
2019-Feb-12 04:53 UTC
[Gluster-users] Message repeated over and over after upgrade from 4.1 to 5.3: W [dict.c:761:dict_ref] (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) [0x7fd966fcd329] -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) [0x7fd9671deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) [0x7fd9731ea218] ) 2-dict: dict is NULL [Invalid argument]
Great job identifying the issue! Any ETA on the next release with the logging and crash fixes in it? On Mon, Feb 11, 2019, 7:19 PM Raghavendra Gowdappa <rgowdapp at redhat.com> wrote:> > > On Mon, Feb 11, 2019 at 3:49 PM Jo?o Ba?to < > joao.bauto at neuro.fchampalimaud.org> wrote: > >> Although I don't have these error messages, I'm having fuse crashes as >> frequent as you. I have disabled write-behind and the mount has been >> running over the weekend with heavy usage and no issues. >> > > The issue you are facing will likely be fixed by patch [1]. Me, Xavi and > Nithya were able to identify the corruption in write-behind. > > [1] https://review.gluster.org/22189 > > >> I can provide coredumps before disabling write-behind if needed. I opened >> a BZ report <https://bugzilla.redhat.com/show_bug.cgi?id=1671014> with >> the crashes that I was having. >> >> *Jo?o Ba?to* >> --------------- >> >> *Scientific Computing and Software Platform* >> Champalimaud Research >> Champalimaud Center for the Unknown >> Av. Bras?lia, Doca de Pedrou?os >> 1400-038 Lisbon, Portugal >> fchampalimaud.org <https://www.fchampalimaud.org/> >> >> >> Artem Russakovskii <archon810 at gmail.com> escreveu no dia s?bado, >> 9/02/2019 ?(s) 22:18: >> >>> Alright. I've enabled core-dumping (hopefully), so now I'm waiting for >>> the next crash to see if it dumps a core for you guys to remotely debug. >>> >>> Then I can consider setting performance.write-behind to off and >>> monitoring for further crashes. >>> >>> Sincerely, >>> Artem >>> >>> -- >>> Founder, Android Police <http://www.androidpolice.com>, APK Mirror >>> <http://www.apkmirror.com/>, Illogical Robot LLC >>> beerpla.net | +ArtemRussakovskii >>> <https://plus.google.com/+ArtemRussakovskii> | @ArtemR >>> <http://twitter.com/ArtemR> >>> >>> >>> On Fri, Feb 8, 2019 at 7:22 PM Raghavendra Gowdappa <rgowdapp at redhat.com> >>> wrote: >>> >>>> >>>> >>>> On Sat, Feb 9, 2019 at 12:53 AM Artem Russakovskii <archon810 at gmail.com> >>>> wrote: >>>> >>>>> Hi Nithya, >>>>> >>>>> I can try to disable write-behind as long as it doesn't heavily impact >>>>> performance for us. Which option is it exactly? I don't see it set in my >>>>> list of changed volume variables that I sent you guys earlier. >>>>> >>>> >>>> The option is performance.write-behind >>>> >>>> >>>>> Sincerely, >>>>> Artem >>>>> >>>>> -- >>>>> Founder, Android Police <http://www.androidpolice.com>, APK Mirror >>>>> <http://www.apkmirror.com/>, Illogical Robot LLC >>>>> beerpla.net | +ArtemRussakovskii >>>>> <https://plus.google.com/+ArtemRussakovskii> | @ArtemR >>>>> <http://twitter.com/ArtemR> >>>>> >>>>> >>>>> On Fri, Feb 8, 2019 at 4:57 AM Nithya Balachandran < >>>>> nbalacha at redhat.com> wrote: >>>>> >>>>>> Hi Artem, >>>>>> >>>>>> We have found the cause of one crash. Unfortunately we have not >>>>>> managed to reproduce the one you reported so we don't know if it is the >>>>>> same cause. >>>>>> >>>>>> Can you disable write-behind on the volume and let us know if it >>>>>> solves the problem? If yes, it is likely to be the same issue. >>>>>> >>>>>> >>>>>> regards, >>>>>> Nithya >>>>>> >>>>>> On Fri, 8 Feb 2019 at 06:51, Artem Russakovskii <archon810 at gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Sorry to disappoint, but the crash just happened again, so >>>>>>> lru-limit=0 didn't help. >>>>>>> >>>>>>> Here's the snippet of the crash and the subsequent remount by monit. >>>>>>> >>>>>>> >>>>>>> [2019-02-08 01:13:05.854391] W [dict.c:761:dict_ref] >>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>> [0x7f4402b99329] >>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>> [0x7f4402daaaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>> [0x7f440b6b5218] ) 0-dict: dict is NULL [In >>>>>>> valid argument] >>>>>>> The message "I [MSGID: 108031] >>>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 0-<SNIP>_data1-replicate-0: >>>>>>> selecting local read_child <SNIP>_data1-client-3" repeated 39 times between >>>>>>> [2019-02-08 01:11:18.043286] and [2019-02-08 01:13:07.915604] >>>>>>> The message "E [MSGID: 101191] >>>>>>> [event-epoll.c:671:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch >>>>>>> handler" repeated 515 times between [2019-02-08 01:11:17.932515] and >>>>>>> [2019-02-08 01:13:09.311554] >>>>>>> pending frames: >>>>>>> frame : type(1) op(LOOKUP) >>>>>>> frame : type(0) op(0) >>>>>>> patchset: git://git.gluster.org/glusterfs.git >>>>>>> signal received: 6 >>>>>>> time of crash: >>>>>>> 2019-02-08 01:13:09 >>>>>>> configuration details: >>>>>>> argp 1 >>>>>>> backtrace 1 >>>>>>> dlfcn 1 >>>>>>> libpthread 1 >>>>>>> llistxattr 1 >>>>>>> setfsid 1 >>>>>>> spinlock 1 >>>>>>> epoll.h 1 >>>>>>> xattr.h 1 >>>>>>> st_atim.tv_nsec 1 >>>>>>> package-string: glusterfs 5.3 >>>>>>> /usr/lib64/libglusterfs.so.0(+0x2764c)[0x7f440b6c064c] >>>>>>> /usr/lib64/libglusterfs.so.0(gf_print_trace+0x306)[0x7f440b6cacb6] >>>>>>> /lib64/libc.so.6(+0x36160)[0x7f440a887160] >>>>>>> /lib64/libc.so.6(gsignal+0x110)[0x7f440a8870e0] >>>>>>> /lib64/libc.so.6(abort+0x151)[0x7f440a8886c1] >>>>>>> /lib64/libc.so.6(+0x2e6fa)[0x7f440a87f6fa] >>>>>>> /lib64/libc.so.6(+0x2e772)[0x7f440a87f772] >>>>>>> /lib64/libpthread.so.0(pthread_mutex_lock+0x228)[0x7f440ac150b8] >>>>>>> >>>>>>> /usr/lib64/glusterfs/5.3/xlator/cluster/replicate.so(+0x5dc9d)[0x7f44036f8c9d] >>>>>>> >>>>>>> /usr/lib64/glusterfs/5.3/xlator/cluster/replicate.so(+0x70ba1)[0x7f440370bba1] >>>>>>> >>>>>>> /usr/lib64/glusterfs/5.3/xlator/protocol/client.so(+0x58f3f)[0x7f4403990f3f] >>>>>>> /usr/lib64/libgfrpc.so.0(+0xe820)[0x7f440b48b820] >>>>>>> /usr/lib64/libgfrpc.so.0(+0xeb6f)[0x7f440b48bb6f] >>>>>>> /usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7f440b488063] >>>>>>> >>>>>>> /usr/lib64/glusterfs/5.3/rpc-transport/socket.so(+0xa0b2)[0x7f44050a80b2] >>>>>>> /usr/lib64/libglusterfs.so.0(+0x854c3)[0x7f440b71e4c3] >>>>>>> /lib64/libpthread.so.0(+0x7559)[0x7f440ac12559] >>>>>>> /lib64/libc.so.6(clone+0x3f)[0x7f440a94981f] >>>>>>> --------- >>>>>>> [2019-02-08 01:13:35.628478] I [MSGID: 100030] >>>>>>> [glusterfsd.c:2715:main] 0-/usr/sbin/glusterfs: Started running >>>>>>> /usr/sbin/glusterfs version 5.3 (args: /usr/sbin/glusterfs --lru-limit=0 >>>>>>> --process-name fuse --volfile-server=localhost --volfile-id=/<SNIP>_data1 >>>>>>> /mnt/<SNIP>_data1) >>>>>>> [2019-02-08 01:13:35.637830] I [MSGID: 101190] >>>>>>> [event-epoll.c:622:event_dispatch_epoll_worker] 0-epoll: Started thread >>>>>>> with index 1 >>>>>>> [2019-02-08 01:13:35.651405] I [MSGID: 101190] >>>>>>> [event-epoll.c:622:event_dispatch_epoll_worker] 0-epoll: Started thread >>>>>>> with index 2 >>>>>>> [2019-02-08 01:13:35.651628] I [MSGID: 101190] >>>>>>> [event-epoll.c:622:event_dispatch_epoll_worker] 0-epoll: Started thread >>>>>>> with index 3 >>>>>>> [2019-02-08 01:13:35.651747] I [MSGID: 101190] >>>>>>> [event-epoll.c:622:event_dispatch_epoll_worker] 0-epoll: Started thread >>>>>>> with index 4 >>>>>>> [2019-02-08 01:13:35.652575] I [MSGID: 114020] >>>>>>> [client.c:2354:notify] 0-<SNIP>_data1-client-0: parent translators are >>>>>>> ready, attempting connect on transport >>>>>>> [2019-02-08 01:13:35.652978] I [MSGID: 114020] >>>>>>> [client.c:2354:notify] 0-<SNIP>_data1-client-1: parent translators are >>>>>>> ready, attempting connect on transport >>>>>>> [2019-02-08 01:13:35.655197] I [MSGID: 114020] >>>>>>> [client.c:2354:notify] 0-<SNIP>_data1-client-2: parent translators are >>>>>>> ready, attempting connect on transport >>>>>>> [2019-02-08 01:13:35.655497] I [MSGID: 114020] >>>>>>> [client.c:2354:notify] 0-<SNIP>_data1-client-3: parent translators are >>>>>>> ready, attempting connect on transport >>>>>>> [2019-02-08 01:13:35.655527] I [rpc-clnt.c:2042:rpc_clnt_reconfig] >>>>>>> 0-<SNIP>_data1-client-0: changing port to 49153 (from 0) >>>>>>> Final graph: >>>>>>> >>>>>>> >>>>>>> Sincerely, >>>>>>> Artem >>>>>>> >>>>>>> -- >>>>>>> Founder, Android Police <http://www.androidpolice.com>, APK Mirror >>>>>>> <http://www.apkmirror.com/>, Illogical Robot LLC >>>>>>> beerpla.net | +ArtemRussakovskii >>>>>>> <https://plus.google.com/+ArtemRussakovskii> | @ArtemR >>>>>>> <http://twitter.com/ArtemR> >>>>>>> >>>>>>> >>>>>>> On Thu, Feb 7, 2019 at 1:28 PM Artem Russakovskii < >>>>>>> archon810 at gmail.com> wrote: >>>>>>> >>>>>>>> I've added the lru-limit=0 parameter to the mounts, and I see it's >>>>>>>> taken effect correctly: >>>>>>>> "/usr/sbin/glusterfs --lru-limit=0 --process-name fuse >>>>>>>> --volfile-server=localhost --volfile-id=/<SNIP> /mnt/<SNIP>" >>>>>>>> >>>>>>>> Let's see if it stops crashing or not. >>>>>>>> >>>>>>>> Sincerely, >>>>>>>> Artem >>>>>>>> >>>>>>>> -- >>>>>>>> Founder, Android Police <http://www.androidpolice.com>, APK Mirror >>>>>>>> <http://www.apkmirror.com/>, Illogical Robot LLC >>>>>>>> beerpla.net | +ArtemRussakovskii >>>>>>>> <https://plus.google.com/+ArtemRussakovskii> | @ArtemR >>>>>>>> <http://twitter.com/ArtemR> >>>>>>>> >>>>>>>> >>>>>>>> On Wed, Feb 6, 2019 at 10:48 AM Artem Russakovskii < >>>>>>>> archon810 at gmail.com> wrote: >>>>>>>> >>>>>>>>> Hi Nithya, >>>>>>>>> >>>>>>>>> Indeed, I upgraded from 4.1 to 5.3, at which point I started >>>>>>>>> seeing crashes, and no further releases have been made yet. >>>>>>>>> >>>>>>>>> volume info: >>>>>>>>> Type: Replicate >>>>>>>>> Volume ID: ****SNIP**** >>>>>>>>> Status: Started >>>>>>>>> Snapshot Count: 0 >>>>>>>>> Number of Bricks: 1 x 4 = 4 >>>>>>>>> Transport-type: tcp >>>>>>>>> Bricks: >>>>>>>>> Brick1: ****SNIP**** >>>>>>>>> Brick2: ****SNIP**** >>>>>>>>> Brick3: ****SNIP**** >>>>>>>>> Brick4: ****SNIP**** >>>>>>>>> Options Reconfigured: >>>>>>>>> cluster.quorum-count: 1 >>>>>>>>> cluster.quorum-type: fixed >>>>>>>>> network.ping-timeout: 5 >>>>>>>>> network.remote-dio: enable >>>>>>>>> performance.rda-cache-limit: 256MB >>>>>>>>> performance.readdir-ahead: on >>>>>>>>> performance.parallel-readdir: on >>>>>>>>> network.inode-lru-limit: 500000 >>>>>>>>> performance.md-cache-timeout: 600 >>>>>>>>> performance.cache-invalidation: on >>>>>>>>> performance.stat-prefetch: on >>>>>>>>> features.cache-invalidation-timeout: 600 >>>>>>>>> features.cache-invalidation: on >>>>>>>>> cluster.readdir-optimize: on >>>>>>>>> performance.io-thread-count: 32 >>>>>>>>> server.event-threads: 4 >>>>>>>>> client.event-threads: 4 >>>>>>>>> performance.read-ahead: off >>>>>>>>> cluster.lookup-optimize: on >>>>>>>>> performance.cache-size: 1GB >>>>>>>>> cluster.self-heal-daemon: enable >>>>>>>>> transport.address-family: inet >>>>>>>>> nfs.disable: on >>>>>>>>> performance.client-io-threads: on >>>>>>>>> cluster.granular-entry-heal: enable >>>>>>>>> cluster.data-self-heal-algorithm: full >>>>>>>>> >>>>>>>>> Sincerely, >>>>>>>>> Artem >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Founder, Android Police <http://www.androidpolice.com>, APK Mirror >>>>>>>>> <http://www.apkmirror.com/>, Illogical Robot LLC >>>>>>>>> beerpla.net | +ArtemRussakovskii >>>>>>>>> <https://plus.google.com/+ArtemRussakovskii> | @ArtemR >>>>>>>>> <http://twitter.com/ArtemR> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Wed, Feb 6, 2019 at 12:20 AM Nithya Balachandran < >>>>>>>>> nbalacha at redhat.com> wrote: >>>>>>>>> >>>>>>>>>> Hi Artem, >>>>>>>>>> >>>>>>>>>> Do you still see the crashes with 5.3? If yes, please try mount >>>>>>>>>> the volume using the mount option lru-limit=0 and see if that helps. We are >>>>>>>>>> looking into the crashes and will update when have a fix. >>>>>>>>>> >>>>>>>>>> Also, please provide the gluster volume info for the volume in >>>>>>>>>> question. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> regards, >>>>>>>>>> Nithya >>>>>>>>>> >>>>>>>>>> On Tue, 5 Feb 2019 at 05:31, Artem Russakovskii < >>>>>>>>>> archon810 at gmail.com> wrote: >>>>>>>>>> >>>>>>>>>>> The fuse crash happened two more times, but this time monit >>>>>>>>>>> helped recover within 1 minute, so it's a great workaround for now. >>>>>>>>>>> >>>>>>>>>>> What's odd is that the crashes are only happening on one of 4 >>>>>>>>>>> servers, and I don't know why. >>>>>>>>>>> >>>>>>>>>>> Sincerely, >>>>>>>>>>> Artem >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> Founder, Android Police <http://www.androidpolice.com>, APK >>>>>>>>>>> Mirror <http://www.apkmirror.com/>, Illogical Robot LLC >>>>>>>>>>> beerpla.net | +ArtemRussakovskii >>>>>>>>>>> <https://plus.google.com/+ArtemRussakovskii> | @ArtemR >>>>>>>>>>> <http://twitter.com/ArtemR> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Sat, Feb 2, 2019 at 12:14 PM Artem Russakovskii < >>>>>>>>>>> archon810 at gmail.com> wrote: >>>>>>>>>>> >>>>>>>>>>>> The fuse crash happened again yesterday, to another volume. Are >>>>>>>>>>>> there any mount options that could help mitigate this? >>>>>>>>>>>> >>>>>>>>>>>> In the meantime, I set up a monit (https://mmonit.com/monit/) >>>>>>>>>>>> task to watch and restart the mount, which works and recovers the mount >>>>>>>>>>>> point within a minute. Not ideal, but a temporary workaround. >>>>>>>>>>>> >>>>>>>>>>>> By the way, the way to reproduce this "Transport endpoint is >>>>>>>>>>>> not connected" condition for testing purposes is to kill -9 the right >>>>>>>>>>>> "glusterfs --process-name fuse" process. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> monit check: >>>>>>>>>>>> check filesystem glusterfs_data1 with path /mnt/glusterfs_data1 >>>>>>>>>>>> start program = "/bin/mount /mnt/glusterfs_data1" >>>>>>>>>>>> stop program = "/bin/umount /mnt/glusterfs_data1" >>>>>>>>>>>> if space usage > 90% for 5 times within 15 cycles >>>>>>>>>>>> then alert else if succeeded for 10 cycles then alert >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> stack trace: >>>>>>>>>>>> [2019-02-01 23:22:00.312894] W [dict.c:761:dict_ref] >>>>>>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>>>>>> [0x7fa0249e4329] >>>>>>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>>>>>> [0x7fa024bf5af5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>>>>>> [0x7fa02cf5b218] ) 0-dict: dict is NULL [Invalid argument] >>>>>>>>>>>> [2019-02-01 23:22:00.314051] W [dict.c:761:dict_ref] >>>>>>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>>>>>> [0x7fa0249e4329] >>>>>>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>>>>>> [0x7fa024bf5af5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>>>>>> [0x7fa02cf5b218] ) 0-dict: dict is NULL [Invalid argument] >>>>>>>>>>>> The message "E [MSGID: 101191] >>>>>>>>>>>> [event-epoll.c:671:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch >>>>>>>>>>>> handler" repeated 26 times between [2019-02-01 23:21:20.857333] and >>>>>>>>>>>> [2019-02-01 23:21:56.164427] >>>>>>>>>>>> The message "I [MSGID: 108031] >>>>>>>>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 0-SITE_data3-replicate-0: >>>>>>>>>>>> selecting local read_child SITE_data3-client-3" repeated 27 times between >>>>>>>>>>>> [2019-02-01 23:21:11.142467] and [2019-02-01 23:22:03.474036] >>>>>>>>>>>> pending frames: >>>>>>>>>>>> frame : type(1) op(LOOKUP) >>>>>>>>>>>> frame : type(0) op(0) >>>>>>>>>>>> patchset: git://git.gluster.org/glusterfs.git >>>>>>>>>>>> signal received: 6 >>>>>>>>>>>> time of crash: >>>>>>>>>>>> 2019-02-01 23:22:03 >>>>>>>>>>>> configuration details: >>>>>>>>>>>> argp 1 >>>>>>>>>>>> backtrace 1 >>>>>>>>>>>> dlfcn 1 >>>>>>>>>>>> libpthread 1 >>>>>>>>>>>> llistxattr 1 >>>>>>>>>>>> setfsid 1 >>>>>>>>>>>> spinlock 1 >>>>>>>>>>>> epoll.h 1 >>>>>>>>>>>> xattr.h 1 >>>>>>>>>>>> st_atim.tv_nsec 1 >>>>>>>>>>>> package-string: glusterfs 5.3 >>>>>>>>>>>> /usr/lib64/libglusterfs.so.0(+0x2764c)[0x7fa02cf6664c] >>>>>>>>>>>> >>>>>>>>>>>> /usr/lib64/libglusterfs.so.0(gf_print_trace+0x306)[0x7fa02cf70cb6] >>>>>>>>>>>> /lib64/libc.so.6(+0x36160)[0x7fa02c12d160] >>>>>>>>>>>> /lib64/libc.so.6(gsignal+0x110)[0x7fa02c12d0e0] >>>>>>>>>>>> /lib64/libc.so.6(abort+0x151)[0x7fa02c12e6c1] >>>>>>>>>>>> /lib64/libc.so.6(+0x2e6fa)[0x7fa02c1256fa] >>>>>>>>>>>> /lib64/libc.so.6(+0x2e772)[0x7fa02c125772] >>>>>>>>>>>> /lib64/libpthread.so.0(pthread_mutex_lock+0x228)[0x7fa02c4bb0b8] >>>>>>>>>>>> >>>>>>>>>>>> /usr/lib64/glusterfs/5.3/xlator/cluster/replicate.so(+0x5dc9d)[0x7fa025543c9d] >>>>>>>>>>>> >>>>>>>>>>>> /usr/lib64/glusterfs/5.3/xlator/cluster/replicate.so(+0x70ba1)[0x7fa025556ba1] >>>>>>>>>>>> >>>>>>>>>>>> /usr/lib64/glusterfs/5.3/xlator/protocol/client.so(+0x58f3f)[0x7fa0257dbf3f] >>>>>>>>>>>> /usr/lib64/libgfrpc.so.0(+0xe820)[0x7fa02cd31820] >>>>>>>>>>>> /usr/lib64/libgfrpc.so.0(+0xeb6f)[0x7fa02cd31b6f] >>>>>>>>>>>> >>>>>>>>>>>> /usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7fa02cd2e063] >>>>>>>>>>>> >>>>>>>>>>>> /usr/lib64/glusterfs/5.3/rpc-transport/socket.so(+0xa0b2)[0x7fa02694e0b2] >>>>>>>>>>>> /usr/lib64/libglusterfs.so.0(+0x854c3)[0x7fa02cfc44c3] >>>>>>>>>>>> /lib64/libpthread.so.0(+0x7559)[0x7fa02c4b8559] >>>>>>>>>>>> /lib64/libc.so.6(clone+0x3f)[0x7fa02c1ef81f] >>>>>>>>>>>> >>>>>>>>>>>> Sincerely, >>>>>>>>>>>> Artem >>>>>>>>>>>> >>>>>>>>>>>> -- >>>>>>>>>>>> Founder, Android Police <http://www.androidpolice.com>, APK >>>>>>>>>>>> Mirror <http://www.apkmirror.com/>, Illogical Robot LLC >>>>>>>>>>>> beerpla.net | +ArtemRussakovskii >>>>>>>>>>>> <https://plus.google.com/+ArtemRussakovskii> | @ArtemR >>>>>>>>>>>> <http://twitter.com/ArtemR> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Fri, Feb 1, 2019 at 9:03 AM Artem Russakovskii < >>>>>>>>>>>> archon810 at gmail.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Hi, >>>>>>>>>>>>> >>>>>>>>>>>>> The first (and so far only) crash happened at 2am the next day >>>>>>>>>>>>> after we upgraded, on only one of four servers and only to one of two >>>>>>>>>>>>> mounts. >>>>>>>>>>>>> >>>>>>>>>>>>> I have no idea what caused it, but yeah, we do have a pretty >>>>>>>>>>>>> busy site (apkmirror.com), and it caused a disruption for any >>>>>>>>>>>>> uploads or downloads from that server until I woke up and fixed the mount. >>>>>>>>>>>>> >>>>>>>>>>>>> I wish I could be more helpful but all I have is that stack >>>>>>>>>>>>> trace. >>>>>>>>>>>>> >>>>>>>>>>>>> I'm glad it's a blocker and will hopefully be resolved soon. >>>>>>>>>>>>> >>>>>>>>>>>>> On Thu, Jan 31, 2019, 7:26 PM Amar Tumballi Suryanarayan < >>>>>>>>>>>>> atumball at redhat.com> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Hi Artem, >>>>>>>>>>>>>> >>>>>>>>>>>>>> Opened https://bugzilla.redhat.com/show_bug.cgi?id=1671603 >>>>>>>>>>>>>> (ie, as a clone of other bugs where recent discussions happened), and >>>>>>>>>>>>>> marked it as a blocker for glusterfs-5.4 release. >>>>>>>>>>>>>> >>>>>>>>>>>>>> We already have fixes for log flooding - >>>>>>>>>>>>>> https://review.gluster.org/22128, and are the process of >>>>>>>>>>>>>> identifying and fixing the issue seen with crash. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Can you please tell if the crashes happened as soon as >>>>>>>>>>>>>> upgrade ? or was there any particular pattern you observed before the crash. >>>>>>>>>>>>>> >>>>>>>>>>>>>> -Amar >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Thu, Jan 31, 2019 at 11:40 PM Artem Russakovskii < >>>>>>>>>>>>>> archon810 at gmail.com> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Within 24 hours after updating from rock solid 4.1 to 5.3, I >>>>>>>>>>>>>>> already got a crash which others have mentioned in >>>>>>>>>>>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1313567 and had >>>>>>>>>>>>>>> to unmount, kill gluster, and remount: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> [2019-01-31 09:38:04.317604] W [dict.c:761:dict_ref] >>>>>>>>>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>>>>>>>>> [0x7fcccafcd329] >>>>>>>>>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>>>>>>>>> [0x7fcccb1deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>>>>>>>>> [0x7fccd705b218] ) 2-dict: dict is NULL [Invalid argument] >>>>>>>>>>>>>>> [2019-01-31 09:38:04.319308] W [dict.c:761:dict_ref] >>>>>>>>>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>>>>>>>>> [0x7fcccafcd329] >>>>>>>>>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>>>>>>>>> [0x7fcccb1deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>>>>>>>>> [0x7fccd705b218] ) 2-dict: dict is NULL [Invalid argument] >>>>>>>>>>>>>>> [2019-01-31 09:38:04.320047] W [dict.c:761:dict_ref] >>>>>>>>>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>>>>>>>>> [0x7fcccafcd329] >>>>>>>>>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>>>>>>>>> [0x7fcccb1deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>>>>>>>>> [0x7fccd705b218] ) 2-dict: dict is NULL [Invalid argument] >>>>>>>>>>>>>>> [2019-01-31 09:38:04.320677] W [dict.c:761:dict_ref] >>>>>>>>>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>>>>>>>>> [0x7fcccafcd329] >>>>>>>>>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>>>>>>>>> [0x7fcccb1deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>>>>>>>>> [0x7fccd705b218] ) 2-dict: dict is NULL [Invalid argument] >>>>>>>>>>>>>>> The message "I [MSGID: 108031] >>>>>>>>>>>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data1-replicate-0: >>>>>>>>>>>>>>> selecting local read_child SITE_data1-client-3" repeated 5 times between >>>>>>>>>>>>>>> [2019-01-31 09:37:54.751905] and [2019-01-31 09:38:03.958061] >>>>>>>>>>>>>>> The message "E [MSGID: 101191] >>>>>>>>>>>>>>> [event-epoll.c:671:event_dispatch_epoll_worker] 2-epoll: Failed to dispatch >>>>>>>>>>>>>>> handler" repeated 72 times between [2019-01-31 09:37:53.746741] and >>>>>>>>>>>>>>> [2019-01-31 09:38:04.696993] >>>>>>>>>>>>>>> pending frames: >>>>>>>>>>>>>>> frame : type(1) op(READ) >>>>>>>>>>>>>>> frame : type(1) op(OPEN) >>>>>>>>>>>>>>> frame : type(0) op(0) >>>>>>>>>>>>>>> patchset: git://git.gluster.org/glusterfs.git >>>>>>>>>>>>>>> signal received: 6 >>>>>>>>>>>>>>> time of crash: >>>>>>>>>>>>>>> 2019-01-31 09:38:04 >>>>>>>>>>>>>>> configuration details: >>>>>>>>>>>>>>> argp 1 >>>>>>>>>>>>>>> backtrace 1 >>>>>>>>>>>>>>> dlfcn 1 >>>>>>>>>>>>>>> libpthread 1 >>>>>>>>>>>>>>> llistxattr 1 >>>>>>>>>>>>>>> setfsid 1 >>>>>>>>>>>>>>> spinlock 1 >>>>>>>>>>>>>>> epoll.h 1 >>>>>>>>>>>>>>> xattr.h 1 >>>>>>>>>>>>>>> st_atim.tv_nsec 1 >>>>>>>>>>>>>>> package-string: glusterfs 5.3 >>>>>>>>>>>>>>> /usr/lib64/libglusterfs.so.0(+0x2764c)[0x7fccd706664c] >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> /usr/lib64/libglusterfs.so.0(gf_print_trace+0x306)[0x7fccd7070cb6] >>>>>>>>>>>>>>> /lib64/libc.so.6(+0x36160)[0x7fccd622d160] >>>>>>>>>>>>>>> /lib64/libc.so.6(gsignal+0x110)[0x7fccd622d0e0] >>>>>>>>>>>>>>> /lib64/libc.so.6(abort+0x151)[0x7fccd622e6c1] >>>>>>>>>>>>>>> /lib64/libc.so.6(+0x2e6fa)[0x7fccd62256fa] >>>>>>>>>>>>>>> /lib64/libc.so.6(+0x2e772)[0x7fccd6225772] >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> /lib64/libpthread.so.0(pthread_mutex_lock+0x228)[0x7fccd65bb0b8] >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> /usr/lib64/glusterfs/5.3/xlator/cluster/replicate.so(+0x32c4d)[0x7fcccbb01c4d] >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> /usr/lib64/glusterfs/5.3/xlator/protocol/client.so(+0x65778)[0x7fcccbdd1778] >>>>>>>>>>>>>>> /usr/lib64/libgfrpc.so.0(+0xe820)[0x7fccd6e31820] >>>>>>>>>>>>>>> /usr/lib64/libgfrpc.so.0(+0xeb6f)[0x7fccd6e31b6f] >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> /usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7fccd6e2e063] >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> /usr/lib64/glusterfs/5.3/rpc-transport/socket.so(+0xa0b2)[0x7fccd0b7e0b2] >>>>>>>>>>>>>>> /usr/lib64/libglusterfs.so.0(+0x854c3)[0x7fccd70c44c3] >>>>>>>>>>>>>>> /lib64/libpthread.so.0(+0x7559)[0x7fccd65b8559] >>>>>>>>>>>>>>> /lib64/libc.so.6(clone+0x3f)[0x7fccd62ef81f] >>>>>>>>>>>>>>> --------- >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Do the pending patches fix the crash or only the repeated >>>>>>>>>>>>>>> warnings? I'm running glusterfs on OpenSUSE 15.0 installed via >>>>>>>>>>>>>>> http://download.opensuse.org/repositories/home:/glusterfs:/Leap15-5/openSUSE_Leap_15.0/, >>>>>>>>>>>>>>> not too sure how to make it core dump. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> If it's not fixed by the patches above, has anyone already >>>>>>>>>>>>>>> opened a ticket for the crashes that I can join and monitor? This is going >>>>>>>>>>>>>>> to create a massive problem for us since production systems are crashing. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Thanks. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Sincerely, >>>>>>>>>>>>>>> Artem >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>> Founder, Android Police <http://www.androidpolice.com>, APK >>>>>>>>>>>>>>> Mirror <http://www.apkmirror.com/>, Illogical Robot LLC >>>>>>>>>>>>>>> beerpla.net | +ArtemRussakovskii >>>>>>>>>>>>>>> <https://plus.google.com/+ArtemRussakovskii> | @ArtemR >>>>>>>>>>>>>>> <http://twitter.com/ArtemR> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Wed, Jan 30, 2019 at 6:37 PM Raghavendra Gowdappa < >>>>>>>>>>>>>>> rgowdapp at redhat.com> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Thu, Jan 31, 2019 at 2:14 AM Artem Russakovskii < >>>>>>>>>>>>>>>> archon810 at gmail.com> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Also, not sure if related or not, but I got a ton of these >>>>>>>>>>>>>>>>> "Failed to dispatch handler" in my logs as well. Many people have been >>>>>>>>>>>>>>>>> commenting about this issue here >>>>>>>>>>>>>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1651246. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> https://review.gluster.org/#/c/glusterfs/+/22046/ >>>>>>>>>>>>>>>> addresses this. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> ==> mnt-SITE_data1.log <=>>>>>>>>>>>>>>>>>> [2019-01-30 20:38:20.783713] W [dict.c:761:dict_ref] >>>>>>>>>>>>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>>>>>>>>>>>> [0x7fd966fcd329] >>>>>>>>>>>>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>>>>>>>>>>>> [0x7fd9671deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>>>>>>>>>>>> [0x7fd9731ea218] ) 2-dict: dict is NULL [Invalid argument] >>>>>>>>>>>>>>>>>> ==> mnt-SITE_data3.log <=>>>>>>>>>>>>>>>>>> The message "E [MSGID: 101191] >>>>>>>>>>>>>>>>>> [event-epoll.c:671:event_dispatch_epoll_worker] 2-epoll: Failed to dispatch >>>>>>>>>>>>>>>>>> handler" repeated 413 times between [2019-01-30 20:36:23.881090] and >>>>>>>>>>>>>>>>>> [2019-01-30 20:38:20.015593] >>>>>>>>>>>>>>>>>> The message "I [MSGID: 108031] >>>>>>>>>>>>>>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data3-replicate-0: >>>>>>>>>>>>>>>>>> selecting local read_child SITE_data3-client-0" repeated 42 times between >>>>>>>>>>>>>>>>>> [2019-01-30 20:36:23.290287] and [2019-01-30 20:38:20.280306] >>>>>>>>>>>>>>>>>> ==> mnt-SITE_data1.log <=>>>>>>>>>>>>>>>>>> The message "I [MSGID: 108031] >>>>>>>>>>>>>>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data1-replicate-0: >>>>>>>>>>>>>>>>>> selecting local read_child SITE_data1-client-0" repeated 50 times between >>>>>>>>>>>>>>>>>> [2019-01-30 20:36:22.247367] and [2019-01-30 20:38:19.459789] >>>>>>>>>>>>>>>>>> The message "E [MSGID: 101191] >>>>>>>>>>>>>>>>>> [event-epoll.c:671:event_dispatch_epoll_worker] 2-epoll: Failed to dispatch >>>>>>>>>>>>>>>>>> handler" repeated 2654 times between [2019-01-30 20:36:22.667327] and >>>>>>>>>>>>>>>>>> [2019-01-30 20:38:20.546355] >>>>>>>>>>>>>>>>>> [2019-01-30 20:38:21.492319] I [MSGID: 108031] >>>>>>>>>>>>>>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data1-replicate-0: >>>>>>>>>>>>>>>>>> selecting local read_child SITE_data1-client-0 >>>>>>>>>>>>>>>>>> ==> mnt-SITE_data3.log <=>>>>>>>>>>>>>>>>>> [2019-01-30 20:38:22.349689] I [MSGID: 108031] >>>>>>>>>>>>>>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data3-replicate-0: >>>>>>>>>>>>>>>>>> selecting local read_child SITE_data3-client-0 >>>>>>>>>>>>>>>>>> ==> mnt-SITE_data1.log <=>>>>>>>>>>>>>>>>>> [2019-01-30 20:38:22.762941] E [MSGID: 101191] >>>>>>>>>>>>>>>>>> [event-epoll.c:671:event_dispatch_epoll_worker] 2-epoll: Failed to dispatch >>>>>>>>>>>>>>>>>> handler >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I'm hoping raising the issue here on the mailing list may >>>>>>>>>>>>>>>>> bring some additional eyeballs and get them both fixed. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Thanks. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Sincerely, >>>>>>>>>>>>>>>>> Artem >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>> Founder, Android Police <http://www.androidpolice.com>, APK >>>>>>>>>>>>>>>>> Mirror <http://www.apkmirror.com/>, Illogical Robot LLC >>>>>>>>>>>>>>>>> beerpla.net | +ArtemRussakovskii >>>>>>>>>>>>>>>>> <https://plus.google.com/+ArtemRussakovskii> | @ArtemR >>>>>>>>>>>>>>>>> <http://twitter.com/ArtemR> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On Wed, Jan 30, 2019 at 12:26 PM Artem Russakovskii < >>>>>>>>>>>>>>>>> archon810 at gmail.com> wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> I found a similar issue here: >>>>>>>>>>>>>>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1313567. >>>>>>>>>>>>>>>>>> There's a comment from 3 days ago from someone else with 5.3 who started >>>>>>>>>>>>>>>>>> seeing the spam. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Here's the command that repeats over and over: >>>>>>>>>>>>>>>>>> [2019-01-30 20:23:24.481581] W [dict.c:761:dict_ref] >>>>>>>>>>>>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>>>>>>>>>>>> [0x7fd966fcd329] >>>>>>>>>>>>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>>>>>>>>>>>> [0x7fd9671deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>>>>>>>>>>>> [0x7fd9731ea218] ) 2-dict: dict is NULL [Invalid argument] >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> +Milind Changire <mchangir at redhat.com> Can you check why >>>>>>>>>>>>>>>> this message is logged and send a fix? >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Is there any fix for this issue? >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Thanks. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Sincerely, >>>>>>>>>>>>>>>>>> Artem >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>> Founder, Android Police <http://www.androidpolice.com>, APK >>>>>>>>>>>>>>>>>> Mirror <http://www.apkmirror.com/>, Illogical Robot LLC >>>>>>>>>>>>>>>>>> beerpla.net | +ArtemRussakovskii >>>>>>>>>>>>>>>>>> <https://plus.google.com/+ArtemRussakovskii> | @ArtemR >>>>>>>>>>>>>>>>>> <http://twitter.com/ArtemR> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>>>>> Gluster-users mailing list >>>>>>>>>>>>>>>>> Gluster-users at gluster.org >>>>>>>>>>>>>>>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>>> Gluster-users mailing list >>>>>>>>>>>>>>> Gluster-users at gluster.org >>>>>>>>>>>>>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> -- >>>>>>>>>>>>>> Amar Tumballi (amarts) >>>>>>>>>>>>>> >>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>> Gluster-users mailing list >>>>>>>>>>> Gluster-users at gluster.org >>>>>>>>>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>>>>>>>>> >>>>>>>>>> _______________________________________________ >>>>> Gluster-users mailing list >>>>> Gluster-users at gluster.org >>>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>>> >>>> _______________________________________________ >>> Gluster-users mailing list >>> Gluster-users at gluster.org >>> https://lists.gluster.org/mailman/listinfo/gluster-users >> >>-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20190211/dee7dea3/attachment.html>
Raghavendra Gowdappa
2019-Feb-12 05:34 UTC
[Gluster-users] Message repeated over and over after upgrade from 4.1 to 5.3: W [dict.c:761:dict_ref] (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) [0x7fd966fcd329] -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) [0x7fd9671deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) [0x7fd9731ea218] ) 2-dict: dict is NULL [Invalid argument]
On Tue, Feb 12, 2019 at 10:24 AM Artem Russakovskii <archon810 at gmail.com> wrote:> Great job identifying the issue! > > Any ETA on the next release with the logging and crash fixes in it? >I've marked write-behind corruption as a blocker for release-6. Logging fixes are already in codebase.> On Mon, Feb 11, 2019, 7:19 PM Raghavendra Gowdappa <rgowdapp at redhat.com> > wrote: > >> >> >> On Mon, Feb 11, 2019 at 3:49 PM Jo?o Ba?to < >> joao.bauto at neuro.fchampalimaud.org> wrote: >> >>> Although I don't have these error messages, I'm having fuse crashes as >>> frequent as you. I have disabled write-behind and the mount has been >>> running over the weekend with heavy usage and no issues. >>> >> >> The issue you are facing will likely be fixed by patch [1]. Me, Xavi and >> Nithya were able to identify the corruption in write-behind. >> >> [1] https://review.gluster.org/22189 >> >> >>> I can provide coredumps before disabling write-behind if needed. I >>> opened a BZ report <https://bugzilla.redhat.com/show_bug.cgi?id=1671014> with >>> the crashes that I was having. >>> >>> *Jo?o Ba?to* >>> --------------- >>> >>> *Scientific Computing and Software Platform* >>> Champalimaud Research >>> Champalimaud Center for the Unknown >>> Av. Bras?lia, Doca de Pedrou?os >>> 1400-038 Lisbon, Portugal >>> fchampalimaud.org <https://www.fchampalimaud.org/> >>> >>> >>> Artem Russakovskii <archon810 at gmail.com> escreveu no dia s?bado, >>> 9/02/2019 ?(s) 22:18: >>> >>>> Alright. I've enabled core-dumping (hopefully), so now I'm waiting for >>>> the next crash to see if it dumps a core for you guys to remotely debug. >>>> >>>> Then I can consider setting performance.write-behind to off and >>>> monitoring for further crashes. >>>> >>>> Sincerely, >>>> Artem >>>> >>>> -- >>>> Founder, Android Police <http://www.androidpolice.com>, APK Mirror >>>> <http://www.apkmirror.com/>, Illogical Robot LLC >>>> beerpla.net | +ArtemRussakovskii >>>> <https://plus.google.com/+ArtemRussakovskii> | @ArtemR >>>> <http://twitter.com/ArtemR> >>>> >>>> >>>> On Fri, Feb 8, 2019 at 7:22 PM Raghavendra Gowdappa < >>>> rgowdapp at redhat.com> wrote: >>>> >>>>> >>>>> >>>>> On Sat, Feb 9, 2019 at 12:53 AM Artem Russakovskii < >>>>> archon810 at gmail.com> wrote: >>>>> >>>>>> Hi Nithya, >>>>>> >>>>>> I can try to disable write-behind as long as it doesn't heavily >>>>>> impact performance for us. Which option is it exactly? I don't see it set >>>>>> in my list of changed volume variables that I sent you guys earlier. >>>>>> >>>>> >>>>> The option is performance.write-behind >>>>> >>>>> >>>>>> Sincerely, >>>>>> Artem >>>>>> >>>>>> -- >>>>>> Founder, Android Police <http://www.androidpolice.com>, APK Mirror >>>>>> <http://www.apkmirror.com/>, Illogical Robot LLC >>>>>> beerpla.net | +ArtemRussakovskii >>>>>> <https://plus.google.com/+ArtemRussakovskii> | @ArtemR >>>>>> <http://twitter.com/ArtemR> >>>>>> >>>>>> >>>>>> On Fri, Feb 8, 2019 at 4:57 AM Nithya Balachandran < >>>>>> nbalacha at redhat.com> wrote: >>>>>> >>>>>>> Hi Artem, >>>>>>> >>>>>>> We have found the cause of one crash. Unfortunately we have not >>>>>>> managed to reproduce the one you reported so we don't know if it is the >>>>>>> same cause. >>>>>>> >>>>>>> Can you disable write-behind on the volume and let us know if it >>>>>>> solves the problem? If yes, it is likely to be the same issue. >>>>>>> >>>>>>> >>>>>>> regards, >>>>>>> Nithya >>>>>>> >>>>>>> On Fri, 8 Feb 2019 at 06:51, Artem Russakovskii <archon810 at gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Sorry to disappoint, but the crash just happened again, so >>>>>>>> lru-limit=0 didn't help. >>>>>>>> >>>>>>>> Here's the snippet of the crash and the subsequent remount by monit. >>>>>>>> >>>>>>>> >>>>>>>> [2019-02-08 01:13:05.854391] W [dict.c:761:dict_ref] >>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>> [0x7f4402b99329] >>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>> [0x7f4402daaaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>> [0x7f440b6b5218] ) 0-dict: dict is NULL [In >>>>>>>> valid argument] >>>>>>>> The message "I [MSGID: 108031] >>>>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 0-<SNIP>_data1-replicate-0: >>>>>>>> selecting local read_child <SNIP>_data1-client-3" repeated 39 times between >>>>>>>> [2019-02-08 01:11:18.043286] and [2019-02-08 01:13:07.915604] >>>>>>>> The message "E [MSGID: 101191] >>>>>>>> [event-epoll.c:671:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch >>>>>>>> handler" repeated 515 times between [2019-02-08 01:11:17.932515] and >>>>>>>> [2019-02-08 01:13:09.311554] >>>>>>>> pending frames: >>>>>>>> frame : type(1) op(LOOKUP) >>>>>>>> frame : type(0) op(0) >>>>>>>> patchset: git://git.gluster.org/glusterfs.git >>>>>>>> signal received: 6 >>>>>>>> time of crash: >>>>>>>> 2019-02-08 01:13:09 >>>>>>>> configuration details: >>>>>>>> argp 1 >>>>>>>> backtrace 1 >>>>>>>> dlfcn 1 >>>>>>>> libpthread 1 >>>>>>>> llistxattr 1 >>>>>>>> setfsid 1 >>>>>>>> spinlock 1 >>>>>>>> epoll.h 1 >>>>>>>> xattr.h 1 >>>>>>>> st_atim.tv_nsec 1 >>>>>>>> package-string: glusterfs 5.3 >>>>>>>> /usr/lib64/libglusterfs.so.0(+0x2764c)[0x7f440b6c064c] >>>>>>>> /usr/lib64/libglusterfs.so.0(gf_print_trace+0x306)[0x7f440b6cacb6] >>>>>>>> /lib64/libc.so.6(+0x36160)[0x7f440a887160] >>>>>>>> /lib64/libc.so.6(gsignal+0x110)[0x7f440a8870e0] >>>>>>>> /lib64/libc.so.6(abort+0x151)[0x7f440a8886c1] >>>>>>>> /lib64/libc.so.6(+0x2e6fa)[0x7f440a87f6fa] >>>>>>>> /lib64/libc.so.6(+0x2e772)[0x7f440a87f772] >>>>>>>> /lib64/libpthread.so.0(pthread_mutex_lock+0x228)[0x7f440ac150b8] >>>>>>>> >>>>>>>> /usr/lib64/glusterfs/5.3/xlator/cluster/replicate.so(+0x5dc9d)[0x7f44036f8c9d] >>>>>>>> >>>>>>>> /usr/lib64/glusterfs/5.3/xlator/cluster/replicate.so(+0x70ba1)[0x7f440370bba1] >>>>>>>> >>>>>>>> /usr/lib64/glusterfs/5.3/xlator/protocol/client.so(+0x58f3f)[0x7f4403990f3f] >>>>>>>> /usr/lib64/libgfrpc.so.0(+0xe820)[0x7f440b48b820] >>>>>>>> /usr/lib64/libgfrpc.so.0(+0xeb6f)[0x7f440b48bb6f] >>>>>>>> /usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7f440b488063] >>>>>>>> >>>>>>>> /usr/lib64/glusterfs/5.3/rpc-transport/socket.so(+0xa0b2)[0x7f44050a80b2] >>>>>>>> /usr/lib64/libglusterfs.so.0(+0x854c3)[0x7f440b71e4c3] >>>>>>>> /lib64/libpthread.so.0(+0x7559)[0x7f440ac12559] >>>>>>>> /lib64/libc.so.6(clone+0x3f)[0x7f440a94981f] >>>>>>>> --------- >>>>>>>> [2019-02-08 01:13:35.628478] I [MSGID: 100030] >>>>>>>> [glusterfsd.c:2715:main] 0-/usr/sbin/glusterfs: Started running >>>>>>>> /usr/sbin/glusterfs version 5.3 (args: /usr/sbin/glusterfs --lru-limit=0 >>>>>>>> --process-name fuse --volfile-server=localhost --volfile-id=/<SNIP>_data1 >>>>>>>> /mnt/<SNIP>_data1) >>>>>>>> [2019-02-08 01:13:35.637830] I [MSGID: 101190] >>>>>>>> [event-epoll.c:622:event_dispatch_epoll_worker] 0-epoll: Started thread >>>>>>>> with index 1 >>>>>>>> [2019-02-08 01:13:35.651405] I [MSGID: 101190] >>>>>>>> [event-epoll.c:622:event_dispatch_epoll_worker] 0-epoll: Started thread >>>>>>>> with index 2 >>>>>>>> [2019-02-08 01:13:35.651628] I [MSGID: 101190] >>>>>>>> [event-epoll.c:622:event_dispatch_epoll_worker] 0-epoll: Started thread >>>>>>>> with index 3 >>>>>>>> [2019-02-08 01:13:35.651747] I [MSGID: 101190] >>>>>>>> [event-epoll.c:622:event_dispatch_epoll_worker] 0-epoll: Started thread >>>>>>>> with index 4 >>>>>>>> [2019-02-08 01:13:35.652575] I [MSGID: 114020] >>>>>>>> [client.c:2354:notify] 0-<SNIP>_data1-client-0: parent translators are >>>>>>>> ready, attempting connect on transport >>>>>>>> [2019-02-08 01:13:35.652978] I [MSGID: 114020] >>>>>>>> [client.c:2354:notify] 0-<SNIP>_data1-client-1: parent translators are >>>>>>>> ready, attempting connect on transport >>>>>>>> [2019-02-08 01:13:35.655197] I [MSGID: 114020] >>>>>>>> [client.c:2354:notify] 0-<SNIP>_data1-client-2: parent translators are >>>>>>>> ready, attempting connect on transport >>>>>>>> [2019-02-08 01:13:35.655497] I [MSGID: 114020] >>>>>>>> [client.c:2354:notify] 0-<SNIP>_data1-client-3: parent translators are >>>>>>>> ready, attempting connect on transport >>>>>>>> [2019-02-08 01:13:35.655527] I [rpc-clnt.c:2042:rpc_clnt_reconfig] >>>>>>>> 0-<SNIP>_data1-client-0: changing port to 49153 (from 0) >>>>>>>> Final graph: >>>>>>>> >>>>>>>> >>>>>>>> Sincerely, >>>>>>>> Artem >>>>>>>> >>>>>>>> -- >>>>>>>> Founder, Android Police <http://www.androidpolice.com>, APK Mirror >>>>>>>> <http://www.apkmirror.com/>, Illogical Robot LLC >>>>>>>> beerpla.net | +ArtemRussakovskii >>>>>>>> <https://plus.google.com/+ArtemRussakovskii> | @ArtemR >>>>>>>> <http://twitter.com/ArtemR> >>>>>>>> >>>>>>>> >>>>>>>> On Thu, Feb 7, 2019 at 1:28 PM Artem Russakovskii < >>>>>>>> archon810 at gmail.com> wrote: >>>>>>>> >>>>>>>>> I've added the lru-limit=0 parameter to the mounts, and I see it's >>>>>>>>> taken effect correctly: >>>>>>>>> "/usr/sbin/glusterfs --lru-limit=0 --process-name fuse >>>>>>>>> --volfile-server=localhost --volfile-id=/<SNIP> /mnt/<SNIP>" >>>>>>>>> >>>>>>>>> Let's see if it stops crashing or not. >>>>>>>>> >>>>>>>>> Sincerely, >>>>>>>>> Artem >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Founder, Android Police <http://www.androidpolice.com>, APK Mirror >>>>>>>>> <http://www.apkmirror.com/>, Illogical Robot LLC >>>>>>>>> beerpla.net | +ArtemRussakovskii >>>>>>>>> <https://plus.google.com/+ArtemRussakovskii> | @ArtemR >>>>>>>>> <http://twitter.com/ArtemR> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Wed, Feb 6, 2019 at 10:48 AM Artem Russakovskii < >>>>>>>>> archon810 at gmail.com> wrote: >>>>>>>>> >>>>>>>>>> Hi Nithya, >>>>>>>>>> >>>>>>>>>> Indeed, I upgraded from 4.1 to 5.3, at which point I started >>>>>>>>>> seeing crashes, and no further releases have been made yet. >>>>>>>>>> >>>>>>>>>> volume info: >>>>>>>>>> Type: Replicate >>>>>>>>>> Volume ID: ****SNIP**** >>>>>>>>>> Status: Started >>>>>>>>>> Snapshot Count: 0 >>>>>>>>>> Number of Bricks: 1 x 4 = 4 >>>>>>>>>> Transport-type: tcp >>>>>>>>>> Bricks: >>>>>>>>>> Brick1: ****SNIP**** >>>>>>>>>> Brick2: ****SNIP**** >>>>>>>>>> Brick3: ****SNIP**** >>>>>>>>>> Brick4: ****SNIP**** >>>>>>>>>> Options Reconfigured: >>>>>>>>>> cluster.quorum-count: 1 >>>>>>>>>> cluster.quorum-type: fixed >>>>>>>>>> network.ping-timeout: 5 >>>>>>>>>> network.remote-dio: enable >>>>>>>>>> performance.rda-cache-limit: 256MB >>>>>>>>>> performance.readdir-ahead: on >>>>>>>>>> performance.parallel-readdir: on >>>>>>>>>> network.inode-lru-limit: 500000 >>>>>>>>>> performance.md-cache-timeout: 600 >>>>>>>>>> performance.cache-invalidation: on >>>>>>>>>> performance.stat-prefetch: on >>>>>>>>>> features.cache-invalidation-timeout: 600 >>>>>>>>>> features.cache-invalidation: on >>>>>>>>>> cluster.readdir-optimize: on >>>>>>>>>> performance.io-thread-count: 32 >>>>>>>>>> server.event-threads: 4 >>>>>>>>>> client.event-threads: 4 >>>>>>>>>> performance.read-ahead: off >>>>>>>>>> cluster.lookup-optimize: on >>>>>>>>>> performance.cache-size: 1GB >>>>>>>>>> cluster.self-heal-daemon: enable >>>>>>>>>> transport.address-family: inet >>>>>>>>>> nfs.disable: on >>>>>>>>>> performance.client-io-threads: on >>>>>>>>>> cluster.granular-entry-heal: enable >>>>>>>>>> cluster.data-self-heal-algorithm: full >>>>>>>>>> >>>>>>>>>> Sincerely, >>>>>>>>>> Artem >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> Founder, Android Police <http://www.androidpolice.com>, APK >>>>>>>>>> Mirror <http://www.apkmirror.com/>, Illogical Robot LLC >>>>>>>>>> beerpla.net | +ArtemRussakovskii >>>>>>>>>> <https://plus.google.com/+ArtemRussakovskii> | @ArtemR >>>>>>>>>> <http://twitter.com/ArtemR> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Wed, Feb 6, 2019 at 12:20 AM Nithya Balachandran < >>>>>>>>>> nbalacha at redhat.com> wrote: >>>>>>>>>> >>>>>>>>>>> Hi Artem, >>>>>>>>>>> >>>>>>>>>>> Do you still see the crashes with 5.3? If yes, please try mount >>>>>>>>>>> the volume using the mount option lru-limit=0 and see if that helps. We are >>>>>>>>>>> looking into the crashes and will update when have a fix. >>>>>>>>>>> >>>>>>>>>>> Also, please provide the gluster volume info for the volume in >>>>>>>>>>> question. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> regards, >>>>>>>>>>> Nithya >>>>>>>>>>> >>>>>>>>>>> On Tue, 5 Feb 2019 at 05:31, Artem Russakovskii < >>>>>>>>>>> archon810 at gmail.com> wrote: >>>>>>>>>>> >>>>>>>>>>>> The fuse crash happened two more times, but this time monit >>>>>>>>>>>> helped recover within 1 minute, so it's a great workaround for now. >>>>>>>>>>>> >>>>>>>>>>>> What's odd is that the crashes are only happening on one of 4 >>>>>>>>>>>> servers, and I don't know why. >>>>>>>>>>>> >>>>>>>>>>>> Sincerely, >>>>>>>>>>>> Artem >>>>>>>>>>>> >>>>>>>>>>>> -- >>>>>>>>>>>> Founder, Android Police <http://www.androidpolice.com>, APK >>>>>>>>>>>> Mirror <http://www.apkmirror.com/>, Illogical Robot LLC >>>>>>>>>>>> beerpla.net | +ArtemRussakovskii >>>>>>>>>>>> <https://plus.google.com/+ArtemRussakovskii> | @ArtemR >>>>>>>>>>>> <http://twitter.com/ArtemR> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Sat, Feb 2, 2019 at 12:14 PM Artem Russakovskii < >>>>>>>>>>>> archon810 at gmail.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> The fuse crash happened again yesterday, to another volume. >>>>>>>>>>>>> Are there any mount options that could help mitigate this? >>>>>>>>>>>>> >>>>>>>>>>>>> In the meantime, I set up a monit (https://mmonit.com/monit/) >>>>>>>>>>>>> task to watch and restart the mount, which works and recovers the mount >>>>>>>>>>>>> point within a minute. Not ideal, but a temporary workaround. >>>>>>>>>>>>> >>>>>>>>>>>>> By the way, the way to reproduce this "Transport endpoint is >>>>>>>>>>>>> not connected" condition for testing purposes is to kill -9 the right >>>>>>>>>>>>> "glusterfs --process-name fuse" process. >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> monit check: >>>>>>>>>>>>> check filesystem glusterfs_data1 with path /mnt/glusterfs_data1 >>>>>>>>>>>>> start program = "/bin/mount /mnt/glusterfs_data1" >>>>>>>>>>>>> stop program = "/bin/umount /mnt/glusterfs_data1" >>>>>>>>>>>>> if space usage > 90% for 5 times within 15 cycles >>>>>>>>>>>>> then alert else if succeeded for 10 cycles then alert >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> stack trace: >>>>>>>>>>>>> [2019-02-01 23:22:00.312894] W [dict.c:761:dict_ref] >>>>>>>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>>>>>>> [0x7fa0249e4329] >>>>>>>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>>>>>>> [0x7fa024bf5af5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>>>>>>> [0x7fa02cf5b218] ) 0-dict: dict is NULL [Invalid argument] >>>>>>>>>>>>> [2019-02-01 23:22:00.314051] W [dict.c:761:dict_ref] >>>>>>>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>>>>>>> [0x7fa0249e4329] >>>>>>>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>>>>>>> [0x7fa024bf5af5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>>>>>>> [0x7fa02cf5b218] ) 0-dict: dict is NULL [Invalid argument] >>>>>>>>>>>>> The message "E [MSGID: 101191] >>>>>>>>>>>>> [event-epoll.c:671:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch >>>>>>>>>>>>> handler" repeated 26 times between [2019-02-01 23:21:20.857333] and >>>>>>>>>>>>> [2019-02-01 23:21:56.164427] >>>>>>>>>>>>> The message "I [MSGID: 108031] >>>>>>>>>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 0-SITE_data3-replicate-0: >>>>>>>>>>>>> selecting local read_child SITE_data3-client-3" repeated 27 times between >>>>>>>>>>>>> [2019-02-01 23:21:11.142467] and [2019-02-01 23:22:03.474036] >>>>>>>>>>>>> pending frames: >>>>>>>>>>>>> frame : type(1) op(LOOKUP) >>>>>>>>>>>>> frame : type(0) op(0) >>>>>>>>>>>>> patchset: git://git.gluster.org/glusterfs.git >>>>>>>>>>>>> signal received: 6 >>>>>>>>>>>>> time of crash: >>>>>>>>>>>>> 2019-02-01 23:22:03 >>>>>>>>>>>>> configuration details: >>>>>>>>>>>>> argp 1 >>>>>>>>>>>>> backtrace 1 >>>>>>>>>>>>> dlfcn 1 >>>>>>>>>>>>> libpthread 1 >>>>>>>>>>>>> llistxattr 1 >>>>>>>>>>>>> setfsid 1 >>>>>>>>>>>>> spinlock 1 >>>>>>>>>>>>> epoll.h 1 >>>>>>>>>>>>> xattr.h 1 >>>>>>>>>>>>> st_atim.tv_nsec 1 >>>>>>>>>>>>> package-string: glusterfs 5.3 >>>>>>>>>>>>> /usr/lib64/libglusterfs.so.0(+0x2764c)[0x7fa02cf6664c] >>>>>>>>>>>>> >>>>>>>>>>>>> /usr/lib64/libglusterfs.so.0(gf_print_trace+0x306)[0x7fa02cf70cb6] >>>>>>>>>>>>> /lib64/libc.so.6(+0x36160)[0x7fa02c12d160] >>>>>>>>>>>>> /lib64/libc.so.6(gsignal+0x110)[0x7fa02c12d0e0] >>>>>>>>>>>>> /lib64/libc.so.6(abort+0x151)[0x7fa02c12e6c1] >>>>>>>>>>>>> /lib64/libc.so.6(+0x2e6fa)[0x7fa02c1256fa] >>>>>>>>>>>>> /lib64/libc.so.6(+0x2e772)[0x7fa02c125772] >>>>>>>>>>>>> >>>>>>>>>>>>> /lib64/libpthread.so.0(pthread_mutex_lock+0x228)[0x7fa02c4bb0b8] >>>>>>>>>>>>> >>>>>>>>>>>>> /usr/lib64/glusterfs/5.3/xlator/cluster/replicate.so(+0x5dc9d)[0x7fa025543c9d] >>>>>>>>>>>>> >>>>>>>>>>>>> /usr/lib64/glusterfs/5.3/xlator/cluster/replicate.so(+0x70ba1)[0x7fa025556ba1] >>>>>>>>>>>>> >>>>>>>>>>>>> /usr/lib64/glusterfs/5.3/xlator/protocol/client.so(+0x58f3f)[0x7fa0257dbf3f] >>>>>>>>>>>>> /usr/lib64/libgfrpc.so.0(+0xe820)[0x7fa02cd31820] >>>>>>>>>>>>> /usr/lib64/libgfrpc.so.0(+0xeb6f)[0x7fa02cd31b6f] >>>>>>>>>>>>> >>>>>>>>>>>>> /usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7fa02cd2e063] >>>>>>>>>>>>> >>>>>>>>>>>>> /usr/lib64/glusterfs/5.3/rpc-transport/socket.so(+0xa0b2)[0x7fa02694e0b2] >>>>>>>>>>>>> /usr/lib64/libglusterfs.so.0(+0x854c3)[0x7fa02cfc44c3] >>>>>>>>>>>>> /lib64/libpthread.so.0(+0x7559)[0x7fa02c4b8559] >>>>>>>>>>>>> /lib64/libc.so.6(clone+0x3f)[0x7fa02c1ef81f] >>>>>>>>>>>>> >>>>>>>>>>>>> Sincerely, >>>>>>>>>>>>> Artem >>>>>>>>>>>>> >>>>>>>>>>>>> -- >>>>>>>>>>>>> Founder, Android Police <http://www.androidpolice.com>, APK >>>>>>>>>>>>> Mirror <http://www.apkmirror.com/>, Illogical Robot LLC >>>>>>>>>>>>> beerpla.net | +ArtemRussakovskii >>>>>>>>>>>>> <https://plus.google.com/+ArtemRussakovskii> | @ArtemR >>>>>>>>>>>>> <http://twitter.com/ArtemR> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On Fri, Feb 1, 2019 at 9:03 AM Artem Russakovskii < >>>>>>>>>>>>> archon810 at gmail.com> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Hi, >>>>>>>>>>>>>> >>>>>>>>>>>>>> The first (and so far only) crash happened at 2am the next >>>>>>>>>>>>>> day after we upgraded, on only one of four servers and only to one of two >>>>>>>>>>>>>> mounts. >>>>>>>>>>>>>> >>>>>>>>>>>>>> I have no idea what caused it, but yeah, we do have a pretty >>>>>>>>>>>>>> busy site (apkmirror.com), and it caused a disruption for >>>>>>>>>>>>>> any uploads or downloads from that server until I woke up and fixed the >>>>>>>>>>>>>> mount. >>>>>>>>>>>>>> >>>>>>>>>>>>>> I wish I could be more helpful but all I have is that stack >>>>>>>>>>>>>> trace. >>>>>>>>>>>>>> >>>>>>>>>>>>>> I'm glad it's a blocker and will hopefully be resolved soon. >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Thu, Jan 31, 2019, 7:26 PM Amar Tumballi Suryanarayan < >>>>>>>>>>>>>> atumball at redhat.com> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Hi Artem, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Opened https://bugzilla.redhat.com/show_bug.cgi?id=1671603 >>>>>>>>>>>>>>> (ie, as a clone of other bugs where recent discussions happened), and >>>>>>>>>>>>>>> marked it as a blocker for glusterfs-5.4 release. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> We already have fixes for log flooding - >>>>>>>>>>>>>>> https://review.gluster.org/22128, and are the process of >>>>>>>>>>>>>>> identifying and fixing the issue seen with crash. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Can you please tell if the crashes happened as soon as >>>>>>>>>>>>>>> upgrade ? or was there any particular pattern you observed before the crash. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> -Amar >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Thu, Jan 31, 2019 at 11:40 PM Artem Russakovskii < >>>>>>>>>>>>>>> archon810 at gmail.com> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Within 24 hours after updating from rock solid 4.1 to 5.3, >>>>>>>>>>>>>>>> I already got a crash which others have mentioned in >>>>>>>>>>>>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1313567 and >>>>>>>>>>>>>>>> had to unmount, kill gluster, and remount: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> [2019-01-31 09:38:04.317604] W [dict.c:761:dict_ref] >>>>>>>>>>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>>>>>>>>>> [0x7fcccafcd329] >>>>>>>>>>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>>>>>>>>>> [0x7fcccb1deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>>>>>>>>>> [0x7fccd705b218] ) 2-dict: dict is NULL [Invalid argument] >>>>>>>>>>>>>>>> [2019-01-31 09:38:04.319308] W [dict.c:761:dict_ref] >>>>>>>>>>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>>>>>>>>>> [0x7fcccafcd329] >>>>>>>>>>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>>>>>>>>>> [0x7fcccb1deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>>>>>>>>>> [0x7fccd705b218] ) 2-dict: dict is NULL [Invalid argument] >>>>>>>>>>>>>>>> [2019-01-31 09:38:04.320047] W [dict.c:761:dict_ref] >>>>>>>>>>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>>>>>>>>>> [0x7fcccafcd329] >>>>>>>>>>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>>>>>>>>>> [0x7fcccb1deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>>>>>>>>>> [0x7fccd705b218] ) 2-dict: dict is NULL [Invalid argument] >>>>>>>>>>>>>>>> [2019-01-31 09:38:04.320677] W [dict.c:761:dict_ref] >>>>>>>>>>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>>>>>>>>>> [0x7fcccafcd329] >>>>>>>>>>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>>>>>>>>>> [0x7fcccb1deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>>>>>>>>>> [0x7fccd705b218] ) 2-dict: dict is NULL [Invalid argument] >>>>>>>>>>>>>>>> The message "I [MSGID: 108031] >>>>>>>>>>>>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data1-replicate-0: >>>>>>>>>>>>>>>> selecting local read_child SITE_data1-client-3" repeated 5 times between >>>>>>>>>>>>>>>> [2019-01-31 09:37:54.751905] and [2019-01-31 09:38:03.958061] >>>>>>>>>>>>>>>> The message "E [MSGID: 101191] >>>>>>>>>>>>>>>> [event-epoll.c:671:event_dispatch_epoll_worker] 2-epoll: Failed to dispatch >>>>>>>>>>>>>>>> handler" repeated 72 times between [2019-01-31 09:37:53.746741] and >>>>>>>>>>>>>>>> [2019-01-31 09:38:04.696993] >>>>>>>>>>>>>>>> pending frames: >>>>>>>>>>>>>>>> frame : type(1) op(READ) >>>>>>>>>>>>>>>> frame : type(1) op(OPEN) >>>>>>>>>>>>>>>> frame : type(0) op(0) >>>>>>>>>>>>>>>> patchset: git://git.gluster.org/glusterfs.git >>>>>>>>>>>>>>>> signal received: 6 >>>>>>>>>>>>>>>> time of crash: >>>>>>>>>>>>>>>> 2019-01-31 09:38:04 >>>>>>>>>>>>>>>> configuration details: >>>>>>>>>>>>>>>> argp 1 >>>>>>>>>>>>>>>> backtrace 1 >>>>>>>>>>>>>>>> dlfcn 1 >>>>>>>>>>>>>>>> libpthread 1 >>>>>>>>>>>>>>>> llistxattr 1 >>>>>>>>>>>>>>>> setfsid 1 >>>>>>>>>>>>>>>> spinlock 1 >>>>>>>>>>>>>>>> epoll.h 1 >>>>>>>>>>>>>>>> xattr.h 1 >>>>>>>>>>>>>>>> st_atim.tv_nsec 1 >>>>>>>>>>>>>>>> package-string: glusterfs 5.3 >>>>>>>>>>>>>>>> /usr/lib64/libglusterfs.so.0(+0x2764c)[0x7fccd706664c] >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> /usr/lib64/libglusterfs.so.0(gf_print_trace+0x306)[0x7fccd7070cb6] >>>>>>>>>>>>>>>> /lib64/libc.so.6(+0x36160)[0x7fccd622d160] >>>>>>>>>>>>>>>> /lib64/libc.so.6(gsignal+0x110)[0x7fccd622d0e0] >>>>>>>>>>>>>>>> /lib64/libc.so.6(abort+0x151)[0x7fccd622e6c1] >>>>>>>>>>>>>>>> /lib64/libc.so.6(+0x2e6fa)[0x7fccd62256fa] >>>>>>>>>>>>>>>> /lib64/libc.so.6(+0x2e772)[0x7fccd6225772] >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> /lib64/libpthread.so.0(pthread_mutex_lock+0x228)[0x7fccd65bb0b8] >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> /usr/lib64/glusterfs/5.3/xlator/cluster/replicate.so(+0x32c4d)[0x7fcccbb01c4d] >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> /usr/lib64/glusterfs/5.3/xlator/protocol/client.so(+0x65778)[0x7fcccbdd1778] >>>>>>>>>>>>>>>> /usr/lib64/libgfrpc.so.0(+0xe820)[0x7fccd6e31820] >>>>>>>>>>>>>>>> /usr/lib64/libgfrpc.so.0(+0xeb6f)[0x7fccd6e31b6f] >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> /usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7fccd6e2e063] >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> /usr/lib64/glusterfs/5.3/rpc-transport/socket.so(+0xa0b2)[0x7fccd0b7e0b2] >>>>>>>>>>>>>>>> /usr/lib64/libglusterfs.so.0(+0x854c3)[0x7fccd70c44c3] >>>>>>>>>>>>>>>> /lib64/libpthread.so.0(+0x7559)[0x7fccd65b8559] >>>>>>>>>>>>>>>> /lib64/libc.so.6(clone+0x3f)[0x7fccd62ef81f] >>>>>>>>>>>>>>>> --------- >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Do the pending patches fix the crash or only the repeated >>>>>>>>>>>>>>>> warnings? I'm running glusterfs on OpenSUSE 15.0 installed via >>>>>>>>>>>>>>>> http://download.opensuse.org/repositories/home:/glusterfs:/Leap15-5/openSUSE_Leap_15.0/, >>>>>>>>>>>>>>>> not too sure how to make it core dump. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> If it's not fixed by the patches above, has anyone already >>>>>>>>>>>>>>>> opened a ticket for the crashes that I can join and monitor? This is going >>>>>>>>>>>>>>>> to create a massive problem for us since production systems are crashing. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Thanks. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Sincerely, >>>>>>>>>>>>>>>> Artem >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>> Founder, Android Police <http://www.androidpolice.com>, APK >>>>>>>>>>>>>>>> Mirror <http://www.apkmirror.com/>, Illogical Robot LLC >>>>>>>>>>>>>>>> beerpla.net | +ArtemRussakovskii >>>>>>>>>>>>>>>> <https://plus.google.com/+ArtemRussakovskii> | @ArtemR >>>>>>>>>>>>>>>> <http://twitter.com/ArtemR> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Wed, Jan 30, 2019 at 6:37 PM Raghavendra Gowdappa < >>>>>>>>>>>>>>>> rgowdapp at redhat.com> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On Thu, Jan 31, 2019 at 2:14 AM Artem Russakovskii < >>>>>>>>>>>>>>>>> archon810 at gmail.com> wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Also, not sure if related or not, but I got a ton of >>>>>>>>>>>>>>>>>> these "Failed to dispatch handler" in my logs as well. Many people have >>>>>>>>>>>>>>>>>> been commenting about this issue here >>>>>>>>>>>>>>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1651246. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> https://review.gluster.org/#/c/glusterfs/+/22046/ >>>>>>>>>>>>>>>>> addresses this. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> ==> mnt-SITE_data1.log <=>>>>>>>>>>>>>>>>>>> [2019-01-30 20:38:20.783713] W [dict.c:761:dict_ref] >>>>>>>>>>>>>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>>>>>>>>>>>>> [0x7fd966fcd329] >>>>>>>>>>>>>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>>>>>>>>>>>>> [0x7fd9671deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>>>>>>>>>>>>> [0x7fd9731ea218] ) 2-dict: dict is NULL [Invalid argument] >>>>>>>>>>>>>>>>>>> ==> mnt-SITE_data3.log <=>>>>>>>>>>>>>>>>>>> The message "E [MSGID: 101191] >>>>>>>>>>>>>>>>>>> [event-epoll.c:671:event_dispatch_epoll_worker] 2-epoll: Failed to dispatch >>>>>>>>>>>>>>>>>>> handler" repeated 413 times between [2019-01-30 20:36:23.881090] and >>>>>>>>>>>>>>>>>>> [2019-01-30 20:38:20.015593] >>>>>>>>>>>>>>>>>>> The message "I [MSGID: 108031] >>>>>>>>>>>>>>>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data3-replicate-0: >>>>>>>>>>>>>>>>>>> selecting local read_child SITE_data3-client-0" repeated 42 times between >>>>>>>>>>>>>>>>>>> [2019-01-30 20:36:23.290287] and [2019-01-30 20:38:20.280306] >>>>>>>>>>>>>>>>>>> ==> mnt-SITE_data1.log <=>>>>>>>>>>>>>>>>>>> The message "I [MSGID: 108031] >>>>>>>>>>>>>>>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data1-replicate-0: >>>>>>>>>>>>>>>>>>> selecting local read_child SITE_data1-client-0" repeated 50 times between >>>>>>>>>>>>>>>>>>> [2019-01-30 20:36:22.247367] and [2019-01-30 20:38:19.459789] >>>>>>>>>>>>>>>>>>> The message "E [MSGID: 101191] >>>>>>>>>>>>>>>>>>> [event-epoll.c:671:event_dispatch_epoll_worker] 2-epoll: Failed to dispatch >>>>>>>>>>>>>>>>>>> handler" repeated 2654 times between [2019-01-30 20:36:22.667327] and >>>>>>>>>>>>>>>>>>> [2019-01-30 20:38:20.546355] >>>>>>>>>>>>>>>>>>> [2019-01-30 20:38:21.492319] I [MSGID: 108031] >>>>>>>>>>>>>>>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data1-replicate-0: >>>>>>>>>>>>>>>>>>> selecting local read_child SITE_data1-client-0 >>>>>>>>>>>>>>>>>>> ==> mnt-SITE_data3.log <=>>>>>>>>>>>>>>>>>>> [2019-01-30 20:38:22.349689] I [MSGID: 108031] >>>>>>>>>>>>>>>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data3-replicate-0: >>>>>>>>>>>>>>>>>>> selecting local read_child SITE_data3-client-0 >>>>>>>>>>>>>>>>>>> ==> mnt-SITE_data1.log <=>>>>>>>>>>>>>>>>>>> [2019-01-30 20:38:22.762941] E [MSGID: 101191] >>>>>>>>>>>>>>>>>>> [event-epoll.c:671:event_dispatch_epoll_worker] 2-epoll: Failed to dispatch >>>>>>>>>>>>>>>>>>> handler >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> I'm hoping raising the issue here on the mailing list may >>>>>>>>>>>>>>>>>> bring some additional eyeballs and get them both fixed. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Thanks. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Sincerely, >>>>>>>>>>>>>>>>>> Artem >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>> Founder, Android Police <http://www.androidpolice.com>, APK >>>>>>>>>>>>>>>>>> Mirror <http://www.apkmirror.com/>, Illogical Robot LLC >>>>>>>>>>>>>>>>>> beerpla.net | +ArtemRussakovskii >>>>>>>>>>>>>>>>>> <https://plus.google.com/+ArtemRussakovskii> | @ArtemR >>>>>>>>>>>>>>>>>> <http://twitter.com/ArtemR> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> On Wed, Jan 30, 2019 at 12:26 PM Artem Russakovskii < >>>>>>>>>>>>>>>>>> archon810 at gmail.com> wrote: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> I found a similar issue here: >>>>>>>>>>>>>>>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1313567. >>>>>>>>>>>>>>>>>>> There's a comment from 3 days ago from someone else with 5.3 who started >>>>>>>>>>>>>>>>>>> seeing the spam. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Here's the command that repeats over and over: >>>>>>>>>>>>>>>>>>> [2019-01-30 20:23:24.481581] W [dict.c:761:dict_ref] >>>>>>>>>>>>>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>>>>>>>>>>>>> [0x7fd966fcd329] >>>>>>>>>>>>>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>>>>>>>>>>>>> [0x7fd9671deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>>>>>>>>>>>>> [0x7fd9731ea218] ) 2-dict: dict is NULL [Invalid argument] >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> +Milind Changire <mchangir at redhat.com> Can you check why >>>>>>>>>>>>>>>>> this message is logged and send a fix? >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Is there any fix for this issue? >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Thanks. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Sincerely, >>>>>>>>>>>>>>>>>>> Artem >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>>> Founder, Android Police <http://www.androidpolice.com>, APK >>>>>>>>>>>>>>>>>>> Mirror <http://www.apkmirror.com/>, Illogical Robot LLC >>>>>>>>>>>>>>>>>>> beerpla.net | +ArtemRussakovskii >>>>>>>>>>>>>>>>>>> <https://plus.google.com/+ArtemRussakovskii> | @ArtemR >>>>>>>>>>>>>>>>>>> <http://twitter.com/ArtemR> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>>>>>> Gluster-users mailing list >>>>>>>>>>>>>>>>>> Gluster-users at gluster.org >>>>>>>>>>>>>>>>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>>>> Gluster-users mailing list >>>>>>>>>>>>>>>> Gluster-users at gluster.org >>>>>>>>>>>>>>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>> Amar Tumballi (amarts) >>>>>>>>>>>>>>> >>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>> Gluster-users mailing list >>>>>>>>>>>> Gluster-users at gluster.org >>>>>>>>>>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>>>>>>>>>> >>>>>>>>>>> _______________________________________________ >>>>>> Gluster-users mailing list >>>>>> Gluster-users at gluster.org >>>>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>>>> >>>>> _______________________________________________ >>>> Gluster-users mailing list >>>> Gluster-users at gluster.org >>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>> >>>-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20190212/4f1982b2/attachment.html>