Hi,
Just an update on this - we made our ACLs much, much stricter around
gluster ports and to my knowledge haven't seen a brick death since.
Ben
On Wed, Dec 11, 2019 at 12:43 PM Ben Tasker <btasker at swiftserve.com>
wrote:
> Hi Xavi,
>
> We don't that I'm explicitly aware of, *but* I can't rule it
out as a
> probability as it's possible some of our partners do (some/most
certainly
> have scans done as part of pentests fairly regularly).
>
> But, that does at least give me an avenue to pursue in the meantime,
> thanks!
>
> Ben
>
> On Wed, Dec 11, 2019 at 12:16 PM Xavi Hernandez <jahernan at
redhat.com>
> wrote:
>
>> Hi Ben,
>>
>> I've recently seen some issues that seem similar to yours (based on
the
>> stack trace in the logs). Right now it seems that in these cases the
>> problem is caused by some port scanning tool that triggers an unhandled
>> condition. We are still investigating what is causing this to fix it as
>> soon as possible.
>>
>> Do you have one of these tools on your network ?
>>
>> Regards,
>>
>> Xavi
>>
>> On Tue, Dec 10, 2019 at 7:53 PM Ben Tasker <btasker at
swiftserve.com>
>> wrote:
>>
>>> Hi,
>>>
>>> A little while ago we had an issue with Gluster 6. As it was urgent
we
>>> downgraded to Gluster 5.9 and it went away.
>>>
>>> Some boxes are now running 5.10 and the issue has come back.
>>>
>>> From the operators point of view, the first you know about this is
>>> getting reports that the transport endpoint is not connected:
>>>
>>> OSError: [Errno 107] Transport endpoint is not connected:
'/shared/lfd/benfusetestlfd'
>>>
>>>
>>> If we check, we can see that the brick process has died
>>>
>>> # gluster volume status
>>> Status of volume: shared
>>> Gluster process TCP Port RDMA Port
Online Pid
>>>
------------------------------------------------------------------------------
>>> Brick fa01.gl:/data1/gluster N/A N/A N
N/A
>>> Brick fa02.gl:/data1/gluster N/A N/A N
N/A
>>> Brick fa01.gl:/data2/gluster 49153 0 Y
14136
>>> Brick fa02.gl:/data2/gluster 49153 0 Y
14154
>>> NFS Server on localhost N/A N/A N
N/A
>>> Self-heal Daemon on localhost N/A N/A Y
186193
>>> NFS Server on fa01.gl N/A N/A N
N/A
>>> Self-heal Daemon on fa01.gl N/A N/A Y
6723
>>>
>>>
>>> Looking in the brick logs, we can see that the process crashed, and
we
>>> get a backtrace (of sorts)
>>>
>>> >gen=110, slot->fd=17
>>> pending frames:
>>> patchset: git://git.gluster.org/glusterfs.git
>>> signal received: 11
>>> time of crash:
>>> 2019-07-04 09:42:43
>>> configuration details:
>>> argp 1
>>> backtrace 1
>>> dlfcn 1
>>> libpthread 1
>>> llistxattr 1
>>> setfsid 1
>>> spinlock 1
>>> epoll.h 1
>>> xattr.h 1
>>> st_atim.tv_nsec 1
>>> package-string: glusterfs 6.1
>>> /lib64/libglusterfs.so.0(+0x26db0)[0x7f79984eadb0]
>>> /lib64/libglusterfs.so.0(gf_print_trace+0x334)[0x7f79984f57b4]
>>> /lib64/libc.so.6(+0x36280)[0x7f7996b2a280]
>>>
/usr/lib64/glusterfs/6.1/rpc-transport/socket.so(+0xa4cc)[0x7f798c8af4cc]
>>> /lib64/libglusterfs.so.0(+0x8c286)[0x7f7998550286]
>>> /lib64/libpthread.so.0(+0x7dd5)[0x7f799732add5]
>>> /lib64/libc.so.6(clone+0x6d)[0x7f7996bf1ead]
>>>
>>>
>>> Other than that, there's not a lot in the logs. In syslog we
can see the
>>> client (Gluster's FS is mounted on the boxes) complaining that
the brick's
>>> gone away.
>>>
>>> Software versions (for when this was happening with 6):
>>>
>>> # rpm -qa | grep glus
>>> glusterfs-libs-6.1-1.el7.x86_64
>>> glusterfs-cli-6.1-1.el7.x86_64
>>> centos-release-gluster6-1.0-1.el7.centos.noarch
>>> glusterfs-6.1-1.el7.x86_64
>>> glusterfs-api-6.1-1.el7.x86_64
>>> glusterfs-server-6.1-1.el7.x86_64
>>> glusterfs-client-xlators-6.1-1.el7.x86_64
>>> glusterfs-fuse-6.1-1.el7.x86_64
>>>
>>>
>>> This was happening pretty regularly (uncomfortably so) on boxes
running
>>> Gluster 6. Grepping through the brick logs it's always a
segfault or
>>> sigabrt that leads to brick death
>>>
>>> # grep "signal received:" data*
>>> data1-gluster.log:signal received: 11
>>> data1-gluster.log:signal received: 6
>>> data1-gluster.log:signal received: 6
>>> data1-gluster.log:signal received: 11
>>> data2-gluster.log:signal received: 6
>>>
>>> There's no apparent correlation on times or usage levels that
we could
>>> see. The issue was occurring on a wide array of hardware, spread
across the
>>> globe (but always talking to local - i.e. LAN - peers). All the
same, disks
>>> were checked, RAM checked etc.
>>>
>>> Digging through the logs we were able to find the lines just as the
>>> crash occurs
>>>
>>> [2019-07-07 06:37:00.213490] I [MSGID: 108031]
[afr-common.c:2547:afr_local_discovery_cbk] 0-shared-replicate-1: selecting
local read_child shared-client-2
>>> [2019-07-07 06:37:03.544248] E [MSGID: 108008]
[afr-transaction.c:2877:afr_write_txn_refresh_done] 0-shared-replicate-1:
Failing SETATTR on gfid a9565e4b-9148-4969-91e8-ba816aea8f6a: split-brain
observed. [Input/output error]
>>> [2019-07-07 06:37:03.544312] W [MSGID: 0]
[dht-inode-write.c:1156:dht_non_mds_setattr_cbk] 0-shared-dht: subvolume
shared-replicate-1 returned -1
>>> [2019-07-07 06:37:03.545317] E [MSGID: 108008]
[afr-transaction.c:2877:afr_write_txn_refresh_done] 0-shared-replicate-1:
Failing SETATTR on gfid a8dd2910-ff64-4ced-81ef-01852b7094ae: split-brain
observed. [Input/output error]
>>> [2019-07-07 06:37:03.545382] W
[fuse-bridge.c:1583:fuse_setattr_cbk] 0-glusterfs-fuse: 2241437: SETATTR()
/lfd/benfusetestlfd/_logs => -1 (Input/output error)
>>>
>>> But, it's not the first time that had occurred, so may be
completely
>>> unrelated.
>>>
>>> When this happens, restarting gluster buys some time. It may just
be
>>> coincidental, but our searches through the logs showed *only* the
first
>>> brick process dying, processes for other bricks (some of the boxes
have 4)
>>> don't appear to be affected by this.
>>>
>>> As we had lots and lots of Gluster machines failing across the
network,
>>> at this point we stopped investigating and I came up with a
downgrade
>>> procedure so that we could get production back into a usable state.
>>> Machines running Gluster 6 were downgraded to Gluster 5.9 and the
issue
>>> just went away. Unfortunately other demands came up, so no-one was
able to
>>> follow up on it.
>>>
>>> Tonight though, there's been a brick process fail on a 5.10
machine with
>>> an all too familiar looking BT
>>>
>>> [2019-12-10 17:20:01.708601] I [MSGID: 115029]
[server-handshake.c:537:server_setvolume] 0-shared-server: accepted client from
CTX_ID:84c0d874-4c60-4a49-80f8-b344f3b376ba-GRAPH_ID:0-PID:33972-HOST:fa02.vn10.swiftserve.com-PC_NAME:shared-client-4-RECON_NO:-0
(version: 5.1
>>> 0)
>>> [2019-12-10 17:20:01.745940] I [MSGID: 115036]
[server.c:469:server_rpc_notify] 0-shared-server: disconnecting connection from
CTX_ID:84c0d874-4c60-4a49-80f8-b344f3b376ba-GRAPH_ID:0-PID:33972-HOST:fa02.vn10.swiftserve.com-PC_NAME:shared-client-4-RECON_NO:-0
>>> [2019-12-10 17:20:01.746090] I [MSGID: 101055]
[client_t.c:435:gf_client_unref] 0-shared-server: Shutting down connection
CTX_ID:84c0d874-4c60-4a49-80f8-b344f3b376ba-GRAPH_ID:0-PID:33972-HOST:fa02.vn10.swiftserve.com-PC_NAME:shared-client-4-RECON_NO:-0
>>> pending frames:
>>> patchset: git://git.gluster.org/glusterfs.git
>>> signal received: 11
>>> time of crash:
>>> 2019-12-10 17:21:36
>>> configuration details:
>>> argp 1
>>> backtrace 1
>>> dlfcn 1
>>> libpthread 1
>>> llistxattr 1
>>> setfsid 1
>>> spinlock 1
>>> epoll.h 1
>>> xattr.h 1
>>> st_atim.tv_nsec 1
>>> package-string: glusterfs 5.10
>>> /lib64/libglusterfs.so.0(+0x26650)[0x7f6a1c6f3650]
>>> /lib64/libglusterfs.so.0(gf_print_trace+0x334)[0x7f6a1c6fdc04]
>>> /lib64/libc.so.6(+0x363b0)[0x7f6a1ad543b0]
>>>
/usr/lib64/glusterfs/5.10/rpc-transport/socket.so(+0x9e3b)[0x7f6a112dae3b]
>>> /lib64/libglusterfs.so.0(+0x8aab9)[0x7f6a1c757ab9]
>>> /lib64/libpthread.so.0(+0x7e65)[0x7f6a1b556e65]
>>> /lib64/libc.so.6(clone+0x6d)[0x7f6a1ae1c88d]
>>> ---------
>>>
>>>
>>> Versions this time are
>>>
>>> # rpm -qa | grep glus
>>> glusterfs-server-5.10-1.el7.x86_64
>>> centos-release-gluster5-1.0-1.el7.centos.noarch
>>> glusterfs-fuse-5.10-1.el7.x86_64
>>> glusterfs-libs-5.10-1.el7.x86_64
>>> glusterfs-client-xlators-5.10-1.el7.x86_64
>>> glusterfs-api-5.10-1.el7.x86_64
>>> glusterfs-5.10-1.el7.x86_64
>>> glusterfs-cli-5.10-1.el7.x86_64
>>>
>>>
>>> These boxes have been running 5.10 for less than 48 hours
>>>
>>> Has anyone else run into this? Assuming the root is the same
(it's a
>>> fairly limited BT, so hard to say for sure), was something from 6
>>> backported into 5.10?
>>>
>>> Thanks
>>>
>>> Ben
>>> ________
>>>
>>> Community Meeting Calendar:
>>>
>>> APAC Schedule -
>>> Every 2nd and 4th Tuesday at 11:30 AM IST
>>> Bridge: https://bluejeans.com/441850968
>>>
>>> NA/EMEA Schedule -
>>> Every 1st and 3rd Tuesday at 01:00 PM EDT
>>> Bridge: https://bluejeans.com/441850968
>>>
>>> Gluster-users mailing list
>>> Gluster-users at gluster.org
>>> https://lists.gluster.org/mailman/listinfo/gluster-users
>>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.gluster.org/pipermail/gluster-users/attachments/20200113/14a4ffd2/attachment.html>