thr3ads.net - Gluster users - [Gluster-users] glusterfsd crashing [Mar 2017]

If this information is useful, please help other people find it:
Share via:

Vijay Bellur

2017-Mar-10 17:23 UTC

[Gluster-users] glusterfsd crashing

On Fri, Mar 10, 2017 at 11:17 AM, Sergei Gerasenko <gerases at gmail.com>
wrote:
> Hi,
>
> I'm running gluster 3.7.12. It's an 8-node distributed, replicated
cluster
> (replica 2). It's had been working fine for a long time when all of a
> sudden I started seeing bricks going offline. Researching further I found
> messages like this:
>
> Mar 10 00:02:12 HOSTNAME data-ftp_gluster_brick[23769]: pending frames:
> Mar 10 00:02:12 HOSTNAME data-ftp_gluster_brick[23769]: frame : type(0)
> op(5)
> Mar 10 00:02:12 HOSTNAME data-ftp_gluster_brick[23769]: patchset: git://
> git.gluster.com/glusterfs.git
> Mar 10 00:02:12 HOSTNAME data-ftp_gluster_brick[23769]: signal received: 6
> Mar 10 00:02:12 HOSTNAME data-ftp_gluster_brick[23769]: time of crash:
> Mar 10 00:02:12 HOSTNAME data-ftp_gluster_brick[23769]: 2017-03-10 05:02:12
> Mar 10 00:02:12 HOSTNAME data-ftp_gluster_brick[23769]: configuration
> details:
> Mar 10 00:02:12 HOSTNAME data-ftp_gluster_brick[23769]: argp 1
> Mar 10 00:02:12 HOSTNAME data-ftp_gluster_brick[23769]: backtrace 1
> Mar 10 00:02:12 HOSTNAME data-ftp_gluster_brick[23769]: dlfcn 1
> Mar 10 00:02:12 HOSTNAME data-ftp_gluster_brick[23769]: libpthread 1
> Mar 10 00:02:12 HOSTNAME data-ftp_gluster_brick[23769]: llistxattr 1
> Mar 10 00:02:12 HOSTNAME data-ftp_gluster_brick[23769]: setfsid 1
> Mar 10 00:02:12 HOSTNAME data-ftp_gluster_brick[23769]: spinlock 1
> Mar 10 00:02:12 HOSTNAME data-ftp_gluster_brick[23769]: epoll.h 1
> Mar 10 00:02:12 HOSTNAME data-ftp_gluster_brick[23769]: xattr.h 1
> Mar 10 00:02:12 HOSTNAME data-ftp_gluster_brick[23769]: st_atim.tv_nsec 1
> Mar 10 00:02:12 HOSTNAME data-ftp_gluster_brick[23769]: package-string:
> glusterfs 3.7.12
> Mar 10 00:02:12 HOSTNAME data-ftp_gluster_brick[23769]: ---------
>
> I initially thought it was related to quota support (based on some
> googling), so I turned off quota and also disabled NFS support to simplify
> the debugging. Every time after the crash, I restarted gluster and the
> bricks would go online for several hours only to crash again later. There
> are lots of messages like this preceding the crash:
>
> ...
> [2017-03-10 04:40:46.002225] E [MSGID: 113091] [posix.c:178:posix_lookup]
> 0-ftp_volume-posix: null gfid for path (null)
> [2017-03-10 04:40:46.002278] E [MSGID: 113018] [posix.c:196:posix_lookup]
> 0-ftp_volume-posix: lstat on null failed [Invalid argument]
> The message "E [MSGID: 113091] [posix.c:178:posix_lookup]
> 0-ftp_volume-posix: null gfid for path (null)" repeated 3 times
between
> [2017-03-10 04:40:46.002225] and [2017-03-10 04:40:46.005699]
> The message "E [MSGID: 113018] [posix.c:196:posix_lookup]
> 0-ftp_volume-posix: lstat on null failed [Invalid argument]" repeated
3
> times between [2017-03-10 04:40:46.002278] and [2017-03-10 04:40:46.005701]
> [2017-03-10 04:50:47.002170] E [MSGID: 113091] [posix.c:178:posix_lookup]
> 0-ftp_volume-posix: null gfid for path (null)
> [2017-03-10 04:50:47.002219] E [MSGID: 113018] [posix.c:196:posix_lookup]
> 0-ftp_volume-posix: lstat on null failed [Invalid argument]
> The message "E [MSGID: 113091] [posix.c:178:posix_lookup]
> 0-ftp_volume-posix: null gfid for path (null)" repeated 3 times
between
> [2017-03-10 04:50:47.002170] and [2017-03-10 04:50:47.005623]
> The message "E [MSGID: 113018] [posix.c:196:posix_lookup]
> 0-ftp_volume-posix: lstat on null failed [Invalid argument]" repeated
3
> times between [2017-03-10 04:50:47.002219] and [2017-03-10 04:50:47.005625]
> [2017-03-10 05:00:48.002246] E [MSGID: 113091] [posix.c:178:posix_lookup]
> 0-ftp_volume-posix: null gfid for path (null)
> [2017-03-10 05:00:48.002314] E [MSGID: 113018] [posix.c:196:posix_lookup]
> 0-ftp_volume-posix: lstat on null failed [Invalid argument]
> The message "E [MSGID: 113091] [posix.c:178:posix_lookup]
> 0-ftp_volume-posix: null gfid for path (null)" repeated 3 times
between
> [2017-03-10 05:00:48.002246] and [2017-03-10 05:00:48.005828]
> The message "E [MSGID: 113018] [posix.c:196:posix_lookup]
> 0-ftp_volume-posix: lstat on null failed [Invalid argument]" repeated
3
> times between [2017-03-10 05:00:48.002314] and [2017-03-10 05:00:48.005830]
>
> One important detail I noticed yesterday is that one of the nodes was
> running gluster version 3.7.13! I'm not sure what did the upgrade. So I
> downgraded to 3.7.12 and restarted gluster. The crash above happened
> several hours later. But again, the crashes had been happening before the
> downgrade -- possibly because of the version mismatch on one of the nodes.
>
> Anybody have any ideas?
>
>
Do you have the core files from the crashes? If so, can you please provide
a gdb backtrace from one of the core files?

Thanks,
Vijay
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.gluster.org/pipermail/gluster-users/attachments/20170310/886a0ba9/attachment.html>

Sergei Gerasenko

2017-Mar-10 17:50 UTC

head link

[Gluster-users] glusterfsd crashing

I see why it's not saving the cores: the package isn't signed with the
right signature. I will modify the abrd configs to change that behavior and
wait for the next crash.

On Fri, Mar 10, 2017 at 11:23 AM, Vijay Bellur <vbellur at redhat.com>
wrote:
>
>
> On Fri, Mar 10, 2017 at 11:17 AM, Sergei Gerasenko <gerases at
gmail.com>
> wrote:
>
>> Hi,
>>
>> I'm running gluster 3.7.12. It's an 8-node distributed,
replicated
>> cluster (replica 2). It's had been working fine for a long time
when all of
>> a sudden I started seeing bricks going offline. Researching further I
found
>> messages like this:
>>
>> Mar 10 00:02:12 HOSTNAME data-ftp_gluster_brick[23769]: pending frames:
>> Mar 10 00:02:12 HOSTNAME data-ftp_gluster_brick[23769]: frame : type(0)
>> op(5)
>> Mar 10 00:02:12 HOSTNAME data-ftp_gluster_brick[23769]: patchset:
git://
>> git.gluster.com/glusterfs.git
>> Mar 10 00:02:12 HOSTNAME data-ftp_gluster_brick[23769]: signal
received: 6
>> Mar 10 00:02:12 HOSTNAME data-ftp_gluster_brick[23769]: time of crash:
>> Mar 10 00:02:12 HOSTNAME data-ftp_gluster_brick[23769]: 2017-03-10
>> 05:02:12
>> Mar 10 00:02:12 HOSTNAME data-ftp_gluster_brick[23769]: configuration
>> details:
>> Mar 10 00:02:12 HOSTNAME data-ftp_gluster_brick[23769]: argp 1
>> Mar 10 00:02:12 HOSTNAME data-ftp_gluster_brick[23769]: backtrace 1
>> Mar 10 00:02:12 HOSTNAME data-ftp_gluster_brick[23769]: dlfcn 1
>> Mar 10 00:02:12 HOSTNAME data-ftp_gluster_brick[23769]: libpthread 1
>> Mar 10 00:02:12 HOSTNAME data-ftp_gluster_brick[23769]: llistxattr 1
>> Mar 10 00:02:12 HOSTNAME data-ftp_gluster_brick[23769]: setfsid 1
>> Mar 10 00:02:12 HOSTNAME data-ftp_gluster_brick[23769]: spinlock 1
>> Mar 10 00:02:12 HOSTNAME data-ftp_gluster_brick[23769]: epoll.h 1
>> Mar 10 00:02:12 HOSTNAME data-ftp_gluster_brick[23769]: xattr.h 1
>> Mar 10 00:02:12 HOSTNAME data-ftp_gluster_brick[23769]: st_atim.tv_nsec
1
>> Mar 10 00:02:12 HOSTNAME data-ftp_gluster_brick[23769]: package-string:
>> glusterfs 3.7.12
>> Mar 10 00:02:12 HOSTNAME data-ftp_gluster_brick[23769]: ---------
>>
>> I initially thought it was related to quota support (based on some
>> googling), so I turned off quota and also disabled NFS support to
simplify
>> the debugging. Every time after the crash, I restarted gluster and the
>> bricks would go online for several hours only to crash again later.
There
>> are lots of messages like this preceding the crash:
>>
>> ...
>> [2017-03-10 04:40:46.002225] E [MSGID: 113091]
[posix.c:178:posix_lookup]
>> 0-ftp_volume-posix: null gfid for path (null)
>> [2017-03-10 04:40:46.002278] E [MSGID: 113018]
[posix.c:196:posix_lookup]
>> 0-ftp_volume-posix: lstat on null failed [Invalid argument]
>> The message "E [MSGID: 113091] [posix.c:178:posix_lookup]
>> 0-ftp_volume-posix: null gfid for path (null)" repeated 3 times
between
>> [2017-03-10 04:40:46.002225] and [2017-03-10 04:40:46.005699]
>> The message "E [MSGID: 113018] [posix.c:196:posix_lookup]
>> 0-ftp_volume-posix: lstat on null failed [Invalid argument]"
repeated 3
>> times between [2017-03-10 04:40:46.002278] and [2017-03-10
04:40:46.005701]
>> [2017-03-10 04:50:47.002170] E [MSGID: 113091]
[posix.c:178:posix_lookup]
>> 0-ftp_volume-posix: null gfid for path (null)
>> [2017-03-10 04:50:47.002219] E [MSGID: 113018]
[posix.c:196:posix_lookup]
>> 0-ftp_volume-posix: lstat on null failed [Invalid argument]
>> The message "E [MSGID: 113091] [posix.c:178:posix_lookup]
>> 0-ftp_volume-posix: null gfid for path (null)" repeated 3 times
between
>> [2017-03-10 04:50:47.002170] and [2017-03-10 04:50:47.005623]
>> The message "E [MSGID: 113018] [posix.c:196:posix_lookup]
>> 0-ftp_volume-posix: lstat on null failed [Invalid argument]"
repeated 3
>> times between [2017-03-10 04:50:47.002219] and [2017-03-10
04:50:47.005625]
>> [2017-03-10 05:00:48.002246] E [MSGID: 113091]
[posix.c:178:posix_lookup]
>> 0-ftp_volume-posix: null gfid for path (null)
>> [2017-03-10 05:00:48.002314] E [MSGID: 113018]
[posix.c:196:posix_lookup]
>> 0-ftp_volume-posix: lstat on null failed [Invalid argument]
>> The message "E [MSGID: 113091] [posix.c:178:posix_lookup]
>> 0-ftp_volume-posix: null gfid for path (null)" repeated 3 times
between
>> [2017-03-10 05:00:48.002246] and [2017-03-10 05:00:48.005828]
>> The message "E [MSGID: 113018] [posix.c:196:posix_lookup]
>> 0-ftp_volume-posix: lstat on null failed [Invalid argument]"
repeated 3
>> times between [2017-03-10 05:00:48.002314] and [2017-03-10
05:00:48.005830]
>>
>> One important detail I noticed yesterday is that one of the nodes was
>> running gluster version 3.7.13! I'm not sure what did the upgrade.
So I
>> downgraded to 3.7.12 and restarted gluster. The crash above happened
>> several hours later. But again, the crashes had been happening before
the
>> downgrade -- possibly because of the version mismatch on one of the
nodes.
>>
>> Anybody have any ideas?
>>
>>
>
> Do you have the core files from the crashes? If so, can you please provide
> a gdb backtrace from one of the core files?
>
> Thanks,
> Vijay
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.gluster.org/pipermail/gluster-users/attachments/20170310/75a813f2/attachment.html>

Gluster users - Mar 2017 - glusterfsd crashing

[Gluster-users] glusterfsd crashing

[Gluster-users] glusterfsd crashing