thr3ads.net - Gluster users - [Gluster-users] Gluster Periodic Brick Process Deaths [Jan 2020]

If this information is useful, please help other people find it:
Share via:

Ben Tasker

2019-Dec-11 12:43 UTC

[Gluster-users] Gluster Periodic Brick Process Deaths

Hi Xavi,

We don't that I'm explicitly aware of, *but* I can't rule it out as
a
probability as it's possible some of our partners do (some/most certainly
have scans done as part of pentests fairly regularly).

But, that does at least give me an avenue to pursue in the meantime, thanks!

Ben

On Wed, Dec 11, 2019 at 12:16 PM Xavi Hernandez <jahernan at redhat.com>
wrote:
> Hi Ben,
>
> I've recently seen some issues that seem similar to yours (based on the
> stack trace in the logs). Right now it seems that in these cases the
> problem is caused by some port scanning tool that triggers an unhandled
> condition. We are still investigating what is causing this to fix it as
> soon as possible.
>
> Do you have one of these tools on your network ?
>
> Regards,
>
> Xavi
>
> On Tue, Dec 10, 2019 at 7:53 PM Ben Tasker <btasker at
swiftserve.com> wrote:
>
>> Hi,
>>
>> A little while ago we had an issue with Gluster 6. As it was urgent we
>> downgraded to Gluster 5.9 and it went away.
>>
>> Some boxes are now running 5.10 and the issue has come back.
>>
>> From the operators point of view, the first you know about this is
>> getting reports that the transport endpoint is not connected:
>>
>> OSError: [Errno 107] Transport endpoint is not connected:
'/shared/lfd/benfusetestlfd'
>>
>>
>> If we check, we can see that the brick process has died
>>
>> # gluster volume status
>> Status of volume: shared
>> Gluster process                             TCP Port  RDMA Port  Online
Pid
>>
------------------------------------------------------------------------------
>> Brick fa01.gl:/data1/gluster                N/A       N/A        N     
N/A
>> Brick fa02.gl:/data1/gluster                N/A       N/A        N     
N/A
>> Brick fa01.gl:/data2/gluster                49153     0          Y     
14136
>> Brick fa02.gl:/data2/gluster                49153     0          Y     
14154
>> NFS Server on localhost                     N/A       N/A        N     
N/A
>> Self-heal Daemon on localhost               N/A       N/A        Y     
186193
>> NFS Server on fa01.gl                       N/A       N/A        N     
N/A
>> Self-heal Daemon on fa01.gl                 N/A       N/A        Y     
6723
>>
>>
>> Looking in the brick logs, we can see that the process crashed, and we
>> get a backtrace (of sorts)
>>
>> >gen=110, slot->fd=17
>> pending frames:
>> patchset: git://git.gluster.org/glusterfs.git
>> signal received: 11
>> time of crash:
>> 2019-07-04 09:42:43
>> configuration details:
>> argp 1
>> backtrace 1
>> dlfcn 1
>> libpthread 1
>> llistxattr 1
>> setfsid 1
>> spinlock 1
>> epoll.h 1
>> xattr.h 1
>> st_atim.tv_nsec 1
>> package-string: glusterfs 6.1
>> /lib64/libglusterfs.so.0(+0x26db0)[0x7f79984eadb0]
>> /lib64/libglusterfs.so.0(gf_print_trace+0x334)[0x7f79984f57b4]
>> /lib64/libc.so.6(+0x36280)[0x7f7996b2a280]
>>
/usr/lib64/glusterfs/6.1/rpc-transport/socket.so(+0xa4cc)[0x7f798c8af4cc]
>> /lib64/libglusterfs.so.0(+0x8c286)[0x7f7998550286]
>> /lib64/libpthread.so.0(+0x7dd5)[0x7f799732add5]
>> /lib64/libc.so.6(clone+0x6d)[0x7f7996bf1ead]
>>
>>
>> Other than that, there's not a lot in the logs. In syslog we can
see the
>> client (Gluster's FS is mounted on the boxes) complaining that the
brick's
>> gone away.
>>
>> Software versions (for when this was happening with 6):
>>
>> # rpm -qa | grep glus
>> glusterfs-libs-6.1-1.el7.x86_64
>> glusterfs-cli-6.1-1.el7.x86_64
>> centos-release-gluster6-1.0-1.el7.centos.noarch
>> glusterfs-6.1-1.el7.x86_64
>> glusterfs-api-6.1-1.el7.x86_64
>> glusterfs-server-6.1-1.el7.x86_64
>> glusterfs-client-xlators-6.1-1.el7.x86_64
>> glusterfs-fuse-6.1-1.el7.x86_64
>>
>>
>> This was happening pretty regularly (uncomfortably so) on boxes running
>> Gluster 6. Grepping through the brick logs it's always a segfault
or
>> sigabrt that leads to brick death
>>
>> # grep "signal received:" data*
>> data1-gluster.log:signal received: 11
>> data1-gluster.log:signal received: 6
>> data1-gluster.log:signal received: 6
>> data1-gluster.log:signal received: 11
>> data2-gluster.log:signal received: 6
>>
>> There's no apparent correlation on times or usage levels that we
could
>> see. The issue was occurring on a wide array of hardware, spread across
the
>> globe (but always talking to local - i.e. LAN - peers). All the same,
disks
>> were checked, RAM checked etc.
>>
>> Digging through the logs we were able to find the lines just as the
crash
>> occurs
>>
>> [2019-07-07 06:37:00.213490] I [MSGID: 108031]
[afr-common.c:2547:afr_local_discovery_cbk] 0-shared-replicate-1: selecting
local read_child shared-client-2
>> [2019-07-07 06:37:03.544248] E [MSGID: 108008]
[afr-transaction.c:2877:afr_write_txn_refresh_done] 0-shared-replicate-1:
Failing SETATTR on gfid a9565e4b-9148-4969-91e8-ba816aea8f6a: split-brain
observed. [Input/output error]
>> [2019-07-07 06:37:03.544312] W [MSGID: 0]
[dht-inode-write.c:1156:dht_non_mds_setattr_cbk] 0-shared-dht: subvolume
shared-replicate-1 returned -1
>> [2019-07-07 06:37:03.545317] E [MSGID: 108008]
[afr-transaction.c:2877:afr_write_txn_refresh_done] 0-shared-replicate-1:
Failing SETATTR on gfid a8dd2910-ff64-4ced-81ef-01852b7094ae: split-brain
observed. [Input/output error]
>> [2019-07-07 06:37:03.545382] W [fuse-bridge.c:1583:fuse_setattr_cbk]
0-glusterfs-fuse: 2241437: SETATTR() /lfd/benfusetestlfd/_logs => -1
(Input/output error)
>>
>> But, it's not the first time that had occurred, so may be
completely
>> unrelated.
>>
>> When this happens, restarting gluster buys some time. It may just be
>> coincidental, but our searches through the logs showed *only* the first
>> brick process dying, processes for other bricks (some of the boxes have
4)
>> don't appear to be affected by this.
>>
>> As we had lots and lots of Gluster machines failing across the network,
>> at this point we stopped investigating and I came up with a downgrade
>> procedure so that we could get production back into a usable state.
>> Machines running Gluster 6 were downgraded to Gluster 5.9 and the issue
>> just went away. Unfortunately other demands came up, so no-one was able
to
>> follow up on it.
>>
>> Tonight though, there's been a brick process fail on a 5.10 machine
with
>> an all too familiar looking BT
>>
>> [2019-12-10 17:20:01.708601] I [MSGID: 115029]
[server-handshake.c:537:server_setvolume] 0-shared-server: accepted client from
CTX_ID:84c0d874-4c60-4a49-80f8-b344f3b376ba-GRAPH_ID:0-PID:33972-HOST:fa02.vn10.swiftserve.com-PC_NAME:shared-client-4-RECON_NO:-0
(version: 5.1
>> 0)
>> [2019-12-10 17:20:01.745940] I [MSGID: 115036]
[server.c:469:server_rpc_notify] 0-shared-server: disconnecting connection from
CTX_ID:84c0d874-4c60-4a49-80f8-b344f3b376ba-GRAPH_ID:0-PID:33972-HOST:fa02.vn10.swiftserve.com-PC_NAME:shared-client-4-RECON_NO:-0
>> [2019-12-10 17:20:01.746090] I [MSGID: 101055]
[client_t.c:435:gf_client_unref] 0-shared-server: Shutting down connection
CTX_ID:84c0d874-4c60-4a49-80f8-b344f3b376ba-GRAPH_ID:0-PID:33972-HOST:fa02.vn10.swiftserve.com-PC_NAME:shared-client-4-RECON_NO:-0
>> pending frames:
>> patchset: git://git.gluster.org/glusterfs.git
>> signal received: 11
>> time of crash:
>> 2019-12-10 17:21:36
>> configuration details:
>> argp 1
>> backtrace 1
>> dlfcn 1
>> libpthread 1
>> llistxattr 1
>> setfsid 1
>> spinlock 1
>> epoll.h 1
>> xattr.h 1
>> st_atim.tv_nsec 1
>> package-string: glusterfs 5.10
>> /lib64/libglusterfs.so.0(+0x26650)[0x7f6a1c6f3650]
>> /lib64/libglusterfs.so.0(gf_print_trace+0x334)[0x7f6a1c6fdc04]
>> /lib64/libc.so.6(+0x363b0)[0x7f6a1ad543b0]
>>
/usr/lib64/glusterfs/5.10/rpc-transport/socket.so(+0x9e3b)[0x7f6a112dae3b]
>> /lib64/libglusterfs.so.0(+0x8aab9)[0x7f6a1c757ab9]
>> /lib64/libpthread.so.0(+0x7e65)[0x7f6a1b556e65]
>> /lib64/libc.so.6(clone+0x6d)[0x7f6a1ae1c88d]
>> ---------
>>
>>
>> Versions this time are
>>
>> # rpm -qa | grep glus
>> glusterfs-server-5.10-1.el7.x86_64
>> centos-release-gluster5-1.0-1.el7.centos.noarch
>> glusterfs-fuse-5.10-1.el7.x86_64
>> glusterfs-libs-5.10-1.el7.x86_64
>> glusterfs-client-xlators-5.10-1.el7.x86_64
>> glusterfs-api-5.10-1.el7.x86_64
>> glusterfs-5.10-1.el7.x86_64
>> glusterfs-cli-5.10-1.el7.x86_64
>>
>>
>> These boxes have been running 5.10 for less than 48 hours
>>
>> Has anyone else run into this? Assuming the root is the same (it's
a
>> fairly limited BT, so hard to say for sure), was something from 6
>> backported into 5.10?
>>
>> Thanks
>>
>> Ben
>> ________
>>
>> Community Meeting Calendar:
>>
>> APAC Schedule -
>> Every 2nd and 4th Tuesday at 11:30 AM IST
>> Bridge: https://bluejeans.com/441850968
>>
>> NA/EMEA Schedule -
>> Every 1st and 3rd Tuesday at 01:00 PM EDT
>> Bridge: https://bluejeans.com/441850968
>>
>> Gluster-users mailing list
>> Gluster-users at gluster.org
>> https://lists.gluster.org/mailman/listinfo/gluster-users
>>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.gluster.org/pipermail/gluster-users/attachments/20191211/4ec59ecb/attachment.html>

Ben Tasker

2020-Jan-13 10:57 UTC

head link

[Gluster-users] Gluster Periodic Brick Process Deaths

Hi,

Just an update on this - we made our ACLs much, much stricter around
gluster ports and to my knowledge haven't seen a brick death since.

Ben

On Wed, Dec 11, 2019 at 12:43 PM Ben Tasker <btasker at swiftserve.com>
wrote:
> Hi Xavi,
>
> We don't that I'm explicitly aware of, *but* I can't rule it
out as a
> probability as it's possible some of our partners do (some/most
certainly
> have scans done as part of pentests fairly regularly).
>
> But, that does at least give me an avenue to pursue in the meantime,
> thanks!
>
> Ben
>
> On Wed, Dec 11, 2019 at 12:16 PM Xavi Hernandez <jahernan at
redhat.com>
> wrote:
>
>> Hi Ben,
>>
>> I've recently seen some issues that seem similar to yours (based on
the
>> stack trace in the logs). Right now it seems that in these cases the
>> problem is caused by some port scanning tool that triggers an unhandled
>> condition. We are still investigating what is causing this to fix it as
>> soon as possible.
>>
>> Do you have one of these tools on your network ?
>>
>> Regards,
>>
>> Xavi
>>
>> On Tue, Dec 10, 2019 at 7:53 PM Ben Tasker <btasker at
swiftserve.com>
>> wrote:
>>
>>> Hi,
>>>
>>> A little while ago we had an issue with Gluster 6. As it was urgent
we
>>> downgraded to Gluster 5.9 and it went away.
>>>
>>> Some boxes are now running 5.10 and the issue has come back.
>>>
>>> From the operators point of view, the first you know about this is
>>> getting reports that the transport endpoint is not connected:
>>>
>>> OSError: [Errno 107] Transport endpoint is not connected:
'/shared/lfd/benfusetestlfd'
>>>
>>>
>>> If we check, we can see that the brick process has died
>>>
>>> # gluster volume status
>>> Status of volume: shared
>>> Gluster process                             TCP Port  RDMA Port 
Online  Pid
>>>
------------------------------------------------------------------------------
>>> Brick fa01.gl:/data1/gluster                N/A       N/A        N 
N/A
>>> Brick fa02.gl:/data1/gluster                N/A       N/A        N 
N/A
>>> Brick fa01.gl:/data2/gluster                49153     0          Y 
14136
>>> Brick fa02.gl:/data2/gluster                49153     0          Y 
14154
>>> NFS Server on localhost                     N/A       N/A        N 
N/A
>>> Self-heal Daemon on localhost               N/A       N/A        Y 
186193
>>> NFS Server on fa01.gl                       N/A       N/A        N 
N/A
>>> Self-heal Daemon on fa01.gl                 N/A       N/A        Y 
6723
>>>
>>>
>>> Looking in the brick logs, we can see that the process crashed, and
we
>>> get a backtrace (of sorts)
>>>
>>> >gen=110, slot->fd=17
>>> pending frames:
>>> patchset: git://git.gluster.org/glusterfs.git
>>> signal received: 11
>>> time of crash:
>>> 2019-07-04 09:42:43
>>> configuration details:
>>> argp 1
>>> backtrace 1
>>> dlfcn 1
>>> libpthread 1
>>> llistxattr 1
>>> setfsid 1
>>> spinlock 1
>>> epoll.h 1
>>> xattr.h 1
>>> st_atim.tv_nsec 1
>>> package-string: glusterfs 6.1
>>> /lib64/libglusterfs.so.0(+0x26db0)[0x7f79984eadb0]
>>> /lib64/libglusterfs.so.0(gf_print_trace+0x334)[0x7f79984f57b4]
>>> /lib64/libc.so.6(+0x36280)[0x7f7996b2a280]
>>>
/usr/lib64/glusterfs/6.1/rpc-transport/socket.so(+0xa4cc)[0x7f798c8af4cc]
>>> /lib64/libglusterfs.so.0(+0x8c286)[0x7f7998550286]
>>> /lib64/libpthread.so.0(+0x7dd5)[0x7f799732add5]
>>> /lib64/libc.so.6(clone+0x6d)[0x7f7996bf1ead]
>>>
>>>
>>> Other than that, there's not a lot in the logs. In syslog we
can see the
>>> client (Gluster's FS is mounted on the boxes) complaining that
the brick's
>>> gone away.
>>>
>>> Software versions (for when this was happening with 6):
>>>
>>> # rpm -qa | grep glus
>>> glusterfs-libs-6.1-1.el7.x86_64
>>> glusterfs-cli-6.1-1.el7.x86_64
>>> centos-release-gluster6-1.0-1.el7.centos.noarch
>>> glusterfs-6.1-1.el7.x86_64
>>> glusterfs-api-6.1-1.el7.x86_64
>>> glusterfs-server-6.1-1.el7.x86_64
>>> glusterfs-client-xlators-6.1-1.el7.x86_64
>>> glusterfs-fuse-6.1-1.el7.x86_64
>>>
>>>
>>> This was happening pretty regularly (uncomfortably so) on boxes
running
>>> Gluster 6. Grepping through the brick logs it's always a
segfault or
>>> sigabrt that leads to brick death
>>>
>>> # grep "signal received:" data*
>>> data1-gluster.log:signal received: 11
>>> data1-gluster.log:signal received: 6
>>> data1-gluster.log:signal received: 6
>>> data1-gluster.log:signal received: 11
>>> data2-gluster.log:signal received: 6
>>>
>>> There's no apparent correlation on times or usage levels that
we could
>>> see. The issue was occurring on a wide array of hardware, spread
across the
>>> globe (but always talking to local - i.e. LAN - peers). All the
same, disks
>>> were checked, RAM checked etc.
>>>
>>> Digging through the logs we were able to find the lines just as the
>>> crash occurs
>>>
>>> [2019-07-07 06:37:00.213490] I [MSGID: 108031]
[afr-common.c:2547:afr_local_discovery_cbk] 0-shared-replicate-1: selecting
local read_child shared-client-2
>>> [2019-07-07 06:37:03.544248] E [MSGID: 108008]
[afr-transaction.c:2877:afr_write_txn_refresh_done] 0-shared-replicate-1:
Failing SETATTR on gfid a9565e4b-9148-4969-91e8-ba816aea8f6a: split-brain
observed. [Input/output error]
>>> [2019-07-07 06:37:03.544312] W [MSGID: 0]
[dht-inode-write.c:1156:dht_non_mds_setattr_cbk] 0-shared-dht: subvolume
shared-replicate-1 returned -1
>>> [2019-07-07 06:37:03.545317] E [MSGID: 108008]
[afr-transaction.c:2877:afr_write_txn_refresh_done] 0-shared-replicate-1:
Failing SETATTR on gfid a8dd2910-ff64-4ced-81ef-01852b7094ae: split-brain
observed. [Input/output error]
>>> [2019-07-07 06:37:03.545382] W
[fuse-bridge.c:1583:fuse_setattr_cbk] 0-glusterfs-fuse: 2241437: SETATTR()
/lfd/benfusetestlfd/_logs => -1 (Input/output error)
>>>
>>> But, it's not the first time that had occurred, so may be
completely
>>> unrelated.
>>>
>>> When this happens, restarting gluster buys some time. It may just
be
>>> coincidental, but our searches through the logs showed *only* the
first
>>> brick process dying, processes for other bricks (some of the boxes
have 4)
>>> don't appear to be affected by this.
>>>
>>> As we had lots and lots of Gluster machines failing across the
network,
>>> at this point we stopped investigating and I came up with a
downgrade
>>> procedure so that we could get production back into a usable state.
>>> Machines running Gluster 6 were downgraded to Gluster 5.9 and the
issue
>>> just went away. Unfortunately other demands came up, so no-one was
able to
>>> follow up on it.
>>>
>>> Tonight though, there's been a brick process fail on a 5.10
machine with
>>> an all too familiar looking BT
>>>
>>> [2019-12-10 17:20:01.708601] I [MSGID: 115029]
[server-handshake.c:537:server_setvolume] 0-shared-server: accepted client from
CTX_ID:84c0d874-4c60-4a49-80f8-b344f3b376ba-GRAPH_ID:0-PID:33972-HOST:fa02.vn10.swiftserve.com-PC_NAME:shared-client-4-RECON_NO:-0
(version: 5.1
>>> 0)
>>> [2019-12-10 17:20:01.745940] I [MSGID: 115036]
[server.c:469:server_rpc_notify] 0-shared-server: disconnecting connection from
CTX_ID:84c0d874-4c60-4a49-80f8-b344f3b376ba-GRAPH_ID:0-PID:33972-HOST:fa02.vn10.swiftserve.com-PC_NAME:shared-client-4-RECON_NO:-0
>>> [2019-12-10 17:20:01.746090] I [MSGID: 101055]
[client_t.c:435:gf_client_unref] 0-shared-server: Shutting down connection
CTX_ID:84c0d874-4c60-4a49-80f8-b344f3b376ba-GRAPH_ID:0-PID:33972-HOST:fa02.vn10.swiftserve.com-PC_NAME:shared-client-4-RECON_NO:-0
>>> pending frames:
>>> patchset: git://git.gluster.org/glusterfs.git
>>> signal received: 11
>>> time of crash:
>>> 2019-12-10 17:21:36
>>> configuration details:
>>> argp 1
>>> backtrace 1
>>> dlfcn 1
>>> libpthread 1
>>> llistxattr 1
>>> setfsid 1
>>> spinlock 1
>>> epoll.h 1
>>> xattr.h 1
>>> st_atim.tv_nsec 1
>>> package-string: glusterfs 5.10
>>> /lib64/libglusterfs.so.0(+0x26650)[0x7f6a1c6f3650]
>>> /lib64/libglusterfs.so.0(gf_print_trace+0x334)[0x7f6a1c6fdc04]
>>> /lib64/libc.so.6(+0x363b0)[0x7f6a1ad543b0]
>>>
/usr/lib64/glusterfs/5.10/rpc-transport/socket.so(+0x9e3b)[0x7f6a112dae3b]
>>> /lib64/libglusterfs.so.0(+0x8aab9)[0x7f6a1c757ab9]
>>> /lib64/libpthread.so.0(+0x7e65)[0x7f6a1b556e65]
>>> /lib64/libc.so.6(clone+0x6d)[0x7f6a1ae1c88d]
>>> ---------
>>>
>>>
>>> Versions this time are
>>>
>>> # rpm -qa | grep glus
>>> glusterfs-server-5.10-1.el7.x86_64
>>> centos-release-gluster5-1.0-1.el7.centos.noarch
>>> glusterfs-fuse-5.10-1.el7.x86_64
>>> glusterfs-libs-5.10-1.el7.x86_64
>>> glusterfs-client-xlators-5.10-1.el7.x86_64
>>> glusterfs-api-5.10-1.el7.x86_64
>>> glusterfs-5.10-1.el7.x86_64
>>> glusterfs-cli-5.10-1.el7.x86_64
>>>
>>>
>>> These boxes have been running 5.10 for less than 48 hours
>>>
>>> Has anyone else run into this? Assuming the root is the same
(it's a
>>> fairly limited BT, so hard to say for sure), was something from 6
>>> backported into 5.10?
>>>
>>> Thanks
>>>
>>> Ben
>>> ________
>>>
>>> Community Meeting Calendar:
>>>
>>> APAC Schedule -
>>> Every 2nd and 4th Tuesday at 11:30 AM IST
>>> Bridge: https://bluejeans.com/441850968
>>>
>>> NA/EMEA Schedule -
>>> Every 1st and 3rd Tuesday at 01:00 PM EDT
>>> Bridge: https://bluejeans.com/441850968
>>>
>>> Gluster-users mailing list
>>> Gluster-users at gluster.org
>>> https://lists.gluster.org/mailman/listinfo/gluster-users
>>>
>>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.gluster.org/pipermail/gluster-users/attachments/20200113/14a4ffd2/attachment.html>

Gluster users - Jan 2020 - Gluster Periodic Brick Process Deaths

[Gluster-users] Gluster Periodic Brick Process Deaths

[Gluster-users] Gluster Periodic Brick Process Deaths