thr3ads.net - Gluster users - [Gluster-users] [Gluster-devel] "rpc_clnt_ping_timer

If this information is useful, please help other people find it:
Share via:
Raghavendra Gowdappa
2019-Dec-01 07:21 UTC
[Gluster-users] [Gluster-devel] "rpc_clnt_ping_timer_expired" errors

adding gluster-users and glusterdevel as the discussion has some generic
points

+Gluster-users <gluster-users at gluster.org> +Gluster Devel
<gluster-devel at gluster.org>

On Mon, Mar 4, 2019 at 11:43 PM Raghavendra Gowdappa <rgowdapp at
redhat.com>
wrote:
>
>
> On Mon, Mar 4, 2019 at 11:26 PM Yaniv Kaul <ykaul at redhat.com>
wrote:
>
>> Is it that busy that it cannot reply for so many seconds to a simple
>> ping? That starved of CPU resources / threads, or are locks taken?
>>
>While I had answered giving some specific examples, my previous answer
missed some general observations:
ping timer is designed with following goals:

1. identifying ungraceful shutdown of bricks (hard power shutdown, pulling
of n/w cable etc)
2. identifying slow bricks and avoiding traffic to them (If connection is
shutdown, clients will not use it till reconnection happens)

When we consider point 2, ping packet is not a special packet. Instead it
experiences same delays as any other regular fs traffic. So, the question
asked above doesn't apply (as ping is not treated "specially" wrt
fs
traffic, if the brick is overloaded processing fs traffic, it is overloaded
to respond a "simple ping" too)

While the implementation/design neatly caters to the first goal, second
goal is not completely met and sort of partly implemented/designed. If goal
2 is to be completely fulfilled, I can think of at least 2 things to be
done currently:

1. reconnect logic in client should make intelligent decision whether to
reconnect to an overloaded brick and I guess identifying whether the brick
is no longer overloaded is not a trivial task
2. As a "fix" to ping-timer expiries, we introduced a "request
queue" to
glusterfs program. Event threads, which read requests from network will
only queue them to this "request queue" and processing of them is done
in
different threads. This (at least in theory) makes event-threads to read
and respond ping packets faster. However, this means ping responses are no
longer indicative of "load" on brick. With hindsight, to me this fix
is
more symptomatic (users no longer see ENOTCONN errors - which pushes the
responsibility out of module handling "ping" :), but they might
experience
slow performance) than address the underlying root cause of what made
bricks unresponsive at first. For eg., Manoj suggested suboptimal
event-threads might be one reason and in fact that was a correct
observation (at least for some setups).

> It looks to be the case. I remember following scenarios from various
> users/customers:
>
> * One customer implemented the suggestion of increasing the number of
> event-threads (that read msgs from network) and reported back it fixed
> ping-timer-expiry issues
> * Once Pranith and Krutika identified that an xattrop (basically a
> getxattr and setxattr done atomically holding a global inode lock) took
> more than 42 seconds (default ping timeout value) - identified through
> strace output - ping timer expiries were co-related at around the same
> time. In this case, event-threads were contending on the same lock and
> hence were blocked from reading further messages from network (that
> included ping requests). This issue (of global critical sections) are fixed
> currently.
> * Preventing event-threads from executing any code from glusterfs brick
> stack seem to have solved the issue. This code is present in master and is
> targeted for 3.4.4
> * I've heard reports saying when rebalance/quota heal is in progress
ping
> timer expiries are more likely to occur.
>
>> Y.
>>
>> On Mon, Mar 4, 2019, 7:47 PM Raghavendra Gowdappa <rgowdapp at
redhat.com>
>> wrote:
>>
>>>
>>> +Gluster Devel <gluster-devel at gluster.org>, +Gluster-users
>>> <gluster-users at gluster.org>
>>>
>>> I would like to point out another issue. Even if what I suggested
>>> prevents disconnects, part of the solution would be only
symptomatic
>>> treatment and doesn't address the root cause of the problem. In
most of the
>>> ping-timer-expiry issues, the root cause is the increased load on
bricks
>>> and the inability of bricks to be responsive under high load. So,
the
>>> actual solution would be doing any or both of the following:
>>> * identify the source of increased load and if possible throttle
it.
>>> Internal heal processes like self-heal, rebalance, quota heal are
known to
>>> pump traffic into bricks without much throttling (io-threads
_might_ do
>>> some throttling, but my understanding is its not sufficient).
>>> * identify the reason for bricks to become unresponsive during
load.
>>> This may be fixable issues like not enough event-threads to read
from
>>> network or difficult to fix issues like fsync on backend fs
freezing the
>>> process or semi fixable issues (in code) like lock contention.
>>>
>>> So any genuine effort to fix ping-timer-issues (to be honest most
of the
>>> times they are not issues related to rpc/network) would involve
performance
>>> characterization of various subsystems on bricks and clients.
Various
>>> subsystems can include (but not necessarily limited to), underlying
>>> OS/filesystem, glusterfs processes, CPU consumption etc
>>>
>>> regards,
>>> Raghavendra
>>>
>>> On Mon, Mar 4, 2019 at 9:31 PM Mauro Tridici <mauro.tridici at
cmcc.it>
>>> wrote:
>>>
>>>> Thank you, let?s try!
>>>> I will inform you about the effects of the change.
>>>>
>>>> Regards,
>>>> Mauro
>>>>
>>>> On 4 Mar 2019, at 16:55, Raghavendra Gowdappa <rgowdapp at
redhat.com>
>>>> wrote:
>>>>
>>>>
>>>>
>>>> On Mon, Mar 4, 2019 at 8:54 PM Mauro Tridici <mauro.tridici
at cmcc.it>
>>>> wrote:
>>>>
>>>>> Hi Raghavendra,
>>>>>
>>>>> thank you for your reply.
>>>>> Yes, you are right. It is a problem that seems to happen
randomly.
>>>>> At this moment, server.event-threads value is 4. I will try
to
>>>>> increase this value to 8. Do you think that it could be a
valid value ?
>>>>>
>>>>
>>>> Yes. We can try with that. You should see at least frequency of
>>>> ping-timer related disconnects  reduce with this value (even if
it doesn't
>>>> eliminate the problem completely).
>>>>
>>>>
>>>>> Regards,
>>>>> Mauro
>>>>>
>>>>>
>>>>> On 4 Mar 2019, at 15:36, Raghavendra Gowdappa <rgowdapp
at redhat.com>
>>>>> wrote:
>>>>>
>>>>>
>>>>>
>>>>> On Mon, Mar 4, 2019 at 8:01 PM Nithya Balachandran <
>>>>> nbalacha at redhat.com> wrote:
>>>>>
>>>>>> Hi Mauro,
>>>>>>
>>>>>> It looks like some problem on s06. Are all your other
nodes ok? Can
>>>>>> you send us the gluster logs from this node?
>>>>>>
>>>>>> @Raghavendra G <raghavendra at gluster.com> , do
you have any idea as
>>>>>> to how this can be debugged? Maybe running top ? Or
debug brick logs?
>>>>>>
>>>>>
>>>>> If we can reproduce the problem, collecting tcpdump on both
ends of
>>>>> connection will help. But, one common problem is these bugs
are
>>>>> inconsistently reproducible and hence we may not be able to
capture tcpdump
>>>>> at correct intervals. Other than that, we can try to
collect some evidence
>>>>> that poller threads were busy (waiting on locks). But, not
sure what debug
>>>>> data provides that information.
>>>>>
>>>>> From what I know, its difficult to collect evidence for
this issue and
>>>>> we could only reason about it.
>>>>>
>>>>> We can try a workaround though - try increasing
server.event-threads
>>>>> and see whether ping-timer expiry issues go away with an
optimal value. If
>>>>> that's the case, it kind of provides proof for our
hypothesis.
>>>>>
>>>>>
>>>>>>
>>>>>> Regards,
>>>>>> Nithya
>>>>>>
>>>>>> On Mon, 4 Mar 2019 at 15:25, Mauro Tridici
<mauro.tridici at cmcc.it>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi All,
>>>>>>>
>>>>>>> some minutes ago I received this message from
NAGIOS server
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> ****** Nagios *****Notification Type:
PROBLEMService: Brick -
>>>>>>> /gluster/mnt2/brickHost: s06Address: s06-stgState:
CRITICALDate/Time: Mon
>>>>>>> Mar 4 10:25:33 CET 2019Additional Info:CHECK_NRPE
STATE CRITICAL: Socket
>>>>>>> timeout after 10 seconds.*
>>>>>>>
>>>>>>> I checked the network, RAM and CPUs usage on s06
node and everything
>>>>>>> seems to be ok.
>>>>>>> No bricks are in error state. In /var/log/messages,
I detected again
>>>>>>> a crash of ?check_vol_utili? that I think it is a
module used by NRPE
>>>>>>> executable (that is the NAGIOS client).
>>>>>>>
>>>>>>> Mar  4 10:15:29 s06 kernel: traps:
check_vol_utili[161224] general
>>>>>>> protection ip:7facffa0a66d sp:7ffe9f4e6fc0 error:0
in
>>>>>>> libglusterfs.so.0.0.1[7facff9b7000+f7000]
>>>>>>> Mar  4 10:15:29 s06 abrt-hook-ccpp: Process 161224
(python2.7) of
>>>>>>> user 0 killed by SIGSEGV - dumping core
>>>>>>> Mar  4 10:15:29 s06 abrt-server: Generating
core_backtrace
>>>>>>> Mar  4 10:15:29 s06 abrt-server: Error: Unable to
open './coredump':
>>>>>>> No such file or directory
>>>>>>> Mar  4 10:16:01 s06 systemd: Created slice User
Slice of root.
>>>>>>> Mar  4 10:16:01 s06 systemd: Starting User Slice of
root.
>>>>>>> Mar  4 10:16:01 s06 systemd: Started Session 201010
of user root.
>>>>>>> Mar  4 10:16:01 s06 systemd: Starting Session
201010 of user root.
>>>>>>> Mar  4 10:16:01 s06 systemd: Removed slice User
Slice of root.
>>>>>>> Mar  4 10:16:01 s06 systemd: Stopping User Slice of
root.
>>>>>>> Mar  4 10:16:24 s06 abrt-server: Duplicate: UUID
>>>>>>> Mar  4 10:16:24 s06 abrt-server: DUP_OF_DIR:
>>>>>>> /var/tmp/abrt/ccpp-2018-09-25-12:27:42-13041
>>>>>>> Mar  4 10:16:24 s06 abrt-server: Deleting problem
directory
>>>>>>> ccpp-2019-03-04-10:15:29-161224 (dup of
ccpp-2018-09-25-12:27:42-13041)
>>>>>>> Mar  4 10:16:24 s06 abrt-server: Generating
core_backtrace
>>>>>>> Mar  4 10:16:24 s06 abrt-server: Error: Unable to
open './coredump':
>>>>>>> No such file or directory
>>>>>>> Mar  4 10:16:24 s06 abrt-server: Cannot notify
>>>>>>>
'/var/tmp/abrt/ccpp-2018-09-25-12:27:42-13041' via uReport: Event
>>>>>>> 'report_uReport' exited with 1
>>>>>>> Mar  4 10:16:24 s06 abrt-hook-ccpp: Process 161391
(python2.7) of
>>>>>>> user 0 killed by SIGABRT - dumping core
>>>>>>> Mar  4 10:16:25 s06 abrt-server: Generating
core_backtrace
>>>>>>> Mar  4 10:16:25 s06 abrt-server: Error: Unable to
open './coredump':
>>>>>>> No such file or directory
>>>>>>> Mar  4 10:17:01 s06 systemd: Created slice User
Slice of root.
>>>>>>>
>>>>>>> Also, I noticed the following errors that I think
are very critical:
>>>>>>>
>>>>>>> Mar  4 10:21:12 s06 glustershd[20355]: [2019-03-04
09:21:12.954798]
>>>>>>> C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired]
0-tier2-client-55:
>>>>>>> server 192.168.0.55:49158 has not responded in the
last 42 seconds,
>>>>>>> disconnecting.
>>>>>>> Mar  4 10:22:01 s06 systemd: Created slice User
Slice of root.
>>>>>>> Mar  4 10:22:01 s06 systemd: Starting User Slice of
root.
>>>>>>> Mar  4 10:22:01 s06 systemd: Started Session 201017
of user root.
>>>>>>> Mar  4 10:22:01 s06 systemd: Starting Session
201017 of user root.
>>>>>>> Mar  4 10:22:01 s06 systemd: Removed slice User
Slice of root.
>>>>>>> Mar  4 10:22:01 s06 systemd: Stopping User Slice of
root.
>>>>>>> Mar  4 10:22:03 s06 glustershd[20355]: [2019-03-04
09:22:03.964120]
>>>>>>> C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired]
0-tier2-client-60:
>>>>>>> server 192.168.0.54:49165 has not responded in the
last 42 seconds,
>>>>>>> disconnecting.
>>>>>>> Mar  4 10:23:01 s06 systemd: Created slice User
Slice of root.
>>>>>>> Mar  4 10:23:01 s06 systemd: Starting User Slice of
root.
>>>>>>> Mar  4 10:23:01 s06 systemd: Started Session 201018
of user root.
>>>>>>> Mar  4 10:23:01 s06 systemd: Starting Session
201018 of user root.
>>>>>>> Mar  4 10:23:02 s06 systemd: Removed slice User
Slice of root.
>>>>>>> Mar  4 10:23:02 s06 systemd: Stopping User Slice of
root.
>>>>>>> Mar  4 10:24:01 s06 systemd: Created slice User
Slice of root.
>>>>>>> Mar  4 10:24:01 s06 systemd: Starting User Slice of
root.
>>>>>>> Mar  4 10:24:01 s06 systemd: Started Session 201019
of user root.
>>>>>>> Mar  4 10:24:01 s06 systemd: Starting Session
201019 of user root.
>>>>>>> Mar  4 10:24:01 s06 systemd: Removed slice User
Slice of root.
>>>>>>> Mar  4 10:24:01 s06 systemd: Stopping User Slice of
root.
>>>>>>> Mar  4 10:24:03 s06 glustershd[20355]: [2019-03-04
09:24:03.982502]
>>>>>>> C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired]
0-tier2-client-16:
>>>>>>> server 192.168.0.52:49158 has not responded in the
last 42 seconds,
>>>>>>> disconnecting.
>>>>>>> Mar  4 10:24:05 s06 quotad[20374]: [2019-03-04
09:24:05.746109] C
>>>>>>> [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired]
0-tier2-client-3: server
>>>>>>> 192.168.0.51:49153 has not responded in the last 42
seconds,
>>>>>>> disconnecting.
>>>>>>> Mar  4 10:24:05 s06 quotad[20374]: [2019-03-04
09:24:05.746215] C
>>>>>>> [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired]
0-tier2-client-10: server
>>>>>>> 192.168.0.52:49156 has not responded in the last 42
seconds,
>>>>>>> disconnecting.
>>>>>>> Mar  4 10:24:05 s06 quotad[20374]: [2019-03-04
09:24:05.746260] C
>>>>>>> [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired]
0-tier2-client-21: server
>>>>>>> 192.168.0.51:49159 has not responded in the last 42
seconds,
>>>>>>> disconnecting.
>>>>>>> Mar  4 10:24:05 s06 quotad[20374]: [2019-03-04
09:24:05.746296] C
>>>>>>> [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired]
0-tier2-client-25: server
>>>>>>> 192.168.0.52:49161 has not responded in the last 42
seconds,
>>>>>>> disconnecting.
>>>>>>> Mar  4 10:24:05 s06 quotad[20374]: [2019-03-04
09:24:05.746413] C
>>>>>>> [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired]
0-tier2-client-60: server
>>>>>>> 192.168.0.54:49165 has not responded in the last 42
seconds,
>>>>>>> disconnecting.
>>>>>>> Mar  4 10:24:07 s06 glustershd[20355]: [2019-03-04
09:24:07.982952]
>>>>>>> C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired]
0-tier2-client-45:
>>>>>>> server 192.168.0.54:49155 has not responded in the
last 42 seconds,
>>>>>>> disconnecting.
>>>>>>> Mar  4 10:24:18 s06 glustershd[20355]: [2019-03-04
09:24:18.990929]
>>>>>>> C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired]
0-tier2-client-25:
>>>>>>> server 192.168.0.52:49161 has not responded in the
last 42 seconds,
>>>>>>> disconnecting.
>>>>>>> Mar  4 10:24:31 s06 glustershd[20355]: [2019-03-04
09:24:31.995781]
>>>>>>> C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired]
0-tier2-client-20:
>>>>>>> server 192.168.0.53:49159 has not responded in the
last 42 seconds,
>>>>>>> disconnecting.
>>>>>>> Mar  4 10:25:01 s06 systemd: Created slice User
Slice of root.
>>>>>>> Mar  4 10:25:01 s06 systemd: Starting User Slice of
root.
>>>>>>> Mar  4 10:25:01 s06 systemd: Started Session 201020
of user root.
>>>>>>> Mar  4 10:25:01 s06 systemd: Starting Session
201020 of user root.
>>>>>>> Mar  4 10:25:01 s06 systemd: Removed slice User
Slice of root.
>>>>>>> Mar  4 10:25:01 s06 systemd: Stopping User Slice of
root.
>>>>>>> Mar  4 10:25:57 s06 systemd: Created slice User
Slice of root.
>>>>>>> Mar  4 10:25:57 s06 systemd: Starting User Slice of
root.
>>>>>>> Mar  4 10:25:57 s06 systemd-logind: New session
201021 of user root.
>>>>>>> Mar  4 10:25:57 s06 systemd: Started Session 201021
of user root.
>>>>>>> Mar  4 10:25:57 s06 systemd: Starting Session
201021 of user root.
>>>>>>> Mar  4 10:26:01 s06 systemd: Started Session 201022
of user root.
>>>>>>> Mar  4 10:26:01 s06 systemd: Starting Session
201022 of user root.
>>>>>>> Mar  4 10:26:21 s06 nrpe[162388]: Error: Could not
complete SSL
>>>>>>> handshake with 192.168.1.56: 5
>>>>>>> Mar  4 10:27:01 s06 systemd: Started Session 201023
of user root.
>>>>>>> Mar  4 10:27:01 s06 systemd: Starting Session
201023 of user root.
>>>>>>> Mar  4 10:28:01 s06 systemd: Started Session 201024
of user root.
>>>>>>> Mar  4 10:28:01 s06 systemd: Starting Session
201024 of user root.
>>>>>>> Mar  4 10:29:01 s06 systemd: Started Session 201025
of user root.
>>>>>>> Mar  4 10:29:01 s06 systemd: Starting Session
201025 of user root.
>>>>>>>
>>>>>>> But, unfortunately, I don?t understand why it is
happening.
>>>>>>> Now, NAGIOS server shows that s06 status is ok:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> ****** Nagios *****Notification Type:
RECOVERYService: Brick -
>>>>>>> /gluster/mnt2/brickHost: s06Address: s06-stgState:
OKDate/Time: Mon Mar 4
>>>>>>> 10:35:23 CET 2019Additional Info:OK: Brick
/gluster/mnt2/brick is up*
>>>>>>>
>>>>>>> Nothing is changed from RAM, CPUs, and NETWORK
point of view.
>>>>>>> /var/log/message file has been updated:
>>>>>>>
>>>>>>> Mar  4 10:32:01 s06 systemd: Starting Session
201029 of user root.
>>>>>>> Mar  4 10:32:30 s06 glustershd[20355]: [2019-03-04
09:32:30.069082]
>>>>>>> C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired]
0-tier2-client-10:
>>>>>>> server 192.168.0.52:49156 has not responded in the
last 42 seconds,
>>>>>>> disconnecting.
>>>>>>> Mar  4 10:32:55 s06 glustershd[20355]: [2019-03-04
09:32:55.074689]
>>>>>>> C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired]
0-tier2-client-66:
>>>>>>> server 192.168.0.54:49167 has not responded in the
last 42 seconds,
>>>>>>> disconnecting.
>>>>>>> Mar  4 10:33:01 s06 systemd: Started Session 201030
of user root.
>>>>>>> Mar  4 10:33:01 s06 systemd: Starting Session
201030 of user root.
>>>>>>> Mar  4 10:34:01 s06 systemd: Started Session 201031
of user root.
>>>>>>> Mar  4 10:34:01 s06 systemd: Starting Session
201031 of user root.
>>>>>>> Mar  4 10:35:01 s06 nrpe[162562]: Could not read
request from client
>>>>>>> 192.168.1.56, bailing out...
>>>>>>> Mar  4 10:35:01 s06 nrpe[162562]: INFO: SSL Socket
Shutdown.
>>>>>>> Mar  4 10:35:01 s06 systemd: Started Session 201032
of user root.
>>>>>>> Mar  4 10:35:01 s06 systemd: Starting Session
201032 of user root.
>>>>>>>
>>>>>>> Could you please help me to understand what it?s
happening ?
>>>>>>> Thank you in advance.
>>>>>>>
>>>>>>> Rergards,
>>>>>>> Mauro
>>>>>>>
>>>>>>>
>>>>>>> On 1 Mar 2019, at 12:17, Mauro Tridici
<mauro.tridici at cmcc.it>
>>>>>>> wrote:
>>>>>>>
>>>>>>>
>>>>>>> Thank you, Milind.
>>>>>>> I executed the instructions you suggested:
>>>>>>>
>>>>>>> - grep ?blocked for? /var/log/messages on s06
returns no output (no
>>>>>>> ?blocked? word is detected in messages file);
>>>>>>> - in /var/log/messages file I can see this kind of
error repeated
>>>>>>> for a lot of times:
>>>>>>>
>>>>>>> Mar  1 08:43:01 s06 systemd: Starting Session
196071 of user root.
>>>>>>> Mar  1 08:43:01 s06 systemd: Removed slice User
Slice of root.
>>>>>>> Mar  1 08:43:01 s06 systemd: Stopping User Slice of
root.
>>>>>>> Mar  1 08:43:02 s06 kernel: traps:
check_vol_utili[57091] general
>>>>>>> protection ip:7f88e76ee66d sp:7ffe5a5bcc30 error:0
in
>>>>>>> libglusterfs.so.0.0.1[7f88e769b000+f7000]
>>>>>>> Mar  1 08:43:02 s06 abrt-hook-ccpp: Process 57091
(python2.7) of
>>>>>>> user 0 killed by SIGSEGV - dumping core
>>>>>>> Mar  1 08:43:02 s06 abrt-server: Generating
core_backtrace
>>>>>>> Mar  1 08:43:02 s06 abrt-server: Error: Unable to
open './coredump':
>>>>>>> No such file or directory
>>>>>>> Mar  1 08:43:58 s06 abrt-server: Duplicate: UUID
>>>>>>> Mar  1 08:43:58 s06 abrt-server: DUP_OF_DIR:
>>>>>>> /var/tmp/abrt/ccpp-2018-09-25-12:27:42-13041
>>>>>>> Mar  1 08:43:58 s06 abrt-server: Deleting problem
directory
>>>>>>> ccpp-2019-03-01-08:43:02-57091 (dup of
ccpp-2018-09-25-12:27:42-13041)
>>>>>>> Mar  1 08:43:58 s06 dbus[1872]: [system] Activating
service
>>>>>>> name='org.freedesktop.problems' (using
servicehelper)
>>>>>>> Mar  1 08:43:58 s06 dbus[1872]: [system]
Successfully activated
>>>>>>> service 'org.freedesktop.problems'
>>>>>>> Mar  1 08:43:58 s06 abrt-server: Generating
core_backtrace
>>>>>>> Mar  1 08:43:58 s06 abrt-server: Error: Unable to
open './coredump':
>>>>>>> No such file or directory
>>>>>>> Mar  1 08:43:58 s06 abrt-server: Cannot notify
>>>>>>>
'/var/tmp/abrt/ccpp-2018-09-25-12:27:42-13041' via uReport: Event
>>>>>>> 'report_uReport' exited with 1
>>>>>>> Mar  1 08:44:01 s06 systemd: Created slice User
Slice of root.
>>>>>>> Mar  1 08:44:01 s06 systemd: Starting User Slice of
root.
>>>>>>> Mar  1 08:44:01 s06 systemd: Started Session 196072
of user root.
>>>>>>> Mar  1 08:44:01 s06 systemd: Starting Session
196072 of user root.
>>>>>>> Mar  1 08:44:01 s06 systemd: Removed slice User
Slice of root.
>>>>>>>
>>>>>>> - in /var/log/messages file I can see also 4 errors
related to other
>>>>>>> cluster servers:
>>>>>>>
>>>>>>> Mar  1 11:05:01 s06 systemd: Starting User Slice of
root.
>>>>>>> Mar  1 11:05:01 s06 systemd: Started Session 196230
of user root.
>>>>>>> Mar  1 11:05:01 s06 systemd: Starting Session
196230 of user root.
>>>>>>> Mar  1 11:05:01 s06 systemd: Removed slice User
Slice of root.
>>>>>>> Mar  1 11:05:01 s06 systemd: Stopping User Slice of
root.
>>>>>>> Mar  1 11:05:59 s06 glustershd[70117]: [2019-03-01
10:05:59.347094]
>>>>>>> C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired]
0-tier2-client-33:
>>>>>>> server 192.168.0.51:49163 has not responded in the
last 42 seconds,
>>>>>>> disconnecting.
>>>>>>> Mar  1 11:06:01 s06 systemd: Created slice User
Slice of root.
>>>>>>> Mar  1 11:06:01 s06 systemd: Starting User Slice of
root.
>>>>>>> Mar  1 11:06:01 s06 systemd: Started Session 196231
of user root.
>>>>>>> Mar  1 11:06:01 s06 systemd: Starting Session
196231 of user root.
>>>>>>> Mar  1 11:06:01 s06 systemd: Removed slice User
Slice of root.
>>>>>>> Mar  1 11:06:01 s06 systemd: Stopping User Slice of
root.
>>>>>>> Mar  1 11:06:12 s06 glustershd[70117]: [2019-03-01
10:06:12.351319]
>>>>>>> C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired]
0-tier2-client-1:
>>>>>>> server 192.168.0.52:49153 has not responded in the
last 42 seconds,
>>>>>>> disconnecting.
>>>>>>> Mar  1 11:06:38 s06 glustershd[70117]: [2019-03-01
10:06:38.356920]
>>>>>>> C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired]
0-tier2-client-7:
>>>>>>> server 192.168.0.52:49155 has not responded in the
last 42 seconds,
>>>>>>> disconnecting.
>>>>>>> Mar  1 11:07:01 s06 systemd: Created slice User
Slice of root.
>>>>>>> Mar  1 11:07:01 s06 systemd: Starting User Slice of
root.
>>>>>>> Mar  1 11:07:01 s06 systemd: Started Session 196232
of user root.
>>>>>>> Mar  1 11:07:01 s06 systemd: Starting Session
196232 of user root.
>>>>>>> Mar  1 11:07:01 s06 systemd: Removed slice User
Slice of root.
>>>>>>> Mar  1 11:07:01 s06 systemd: Stopping User Slice of
root.
>>>>>>> Mar  1 11:07:36 s06 glustershd[70117]: [2019-03-01
10:07:36.366259]
>>>>>>> C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired]
0-tier2-client-0:
>>>>>>> server 192.168.0.51:49152 has not responded in the
last 42 seconds,
>>>>>>> disconnecting.
>>>>>>> Mar  1 11:08:01 s06 systemd: Created slice User
Slice of root.
>>>>>>>
>>>>>>> No ?blocked? word is in /var/log/messages files on
other cluster
>>>>>>> servers.
>>>>>>> In attachment, the /var/log/messages file from s06
server.
>>>>>>>
>>>>>>> Thank you in advance,
>>>>>>> Mauro
>>>>>>>
>>>>>>> <messages.zip>
>>>>>>>
>>>>>>>
>>>>>>> On 1 Mar 2019, at 11:47, Milind Changire
<mchangir at redhat.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>> The traces of very high disk activity on the
servers are often found
>>>>>>> in /var/log/messages
>>>>>>> You might want to grep for "blocked for"
in /var/log/messages on s06
>>>>>>> and correlate the timestamps to confirm the
unresponsiveness as reported in
>>>>>>> gluster client logs.
>>>>>>> In cases of high disk activity, although the
operating system
>>>>>>> continues to respond to ICMP pings, the processes
writing to disks often
>>>>>>> get blocked to a large flush to the disk which
could span beyond 42 seconds
>>>>>>> and hence result in ping-timer-expiry logs.
>>>>>>>
>>>>>>> As a side note:
>>>>>>> If you indeed find gluster processes being blocked
in
>>>>>>> /var/log/messages, you might want to tweak sysctl
tunables called
>>>>>>> vm.dirty_background_ratio or
vm.dirty_background_bytes to a smaller value
>>>>>>> than the existing. Please read up more on those
tunables before touching
>>>>>>> the settings.
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Mar 1, 2019 at 4:06 PM Mauro Tridici
<mauro.tridici at cmcc.it>
>>>>>>> wrote:
>>>>>>>
>>>>>>>>
>>>>>>>> Hi all,
>>>>>>>>
>>>>>>>> in attachment the client log captured after
changing
>>>>>>>> network.ping-timeout option.
>>>>>>>> I noticed this error involving server
192.168.0.56 (s06)
>>>>>>>>
>>>>>>>> [2019-03-01 09:23:36.077287] I
[rpc-clnt.c:1962:rpc_clnt_reconfig]
>>>>>>>> 0-tier2-client-71: changing ping timeout to 42
(from 0)
>>>>>>>> [2019-03-01 09:23:36.078213] I
>>>>>>>> [glusterfsd-mgmt.c:1894:mgmt_getspec_cbk]
0-glusterfs: No change in
>>>>>>>> volfile,continuing
>>>>>>>> [2019-03-01 09:23:36.078432] I
>>>>>>>> [glusterfsd-mgmt.c:1894:mgmt_getspec_cbk]
0-glusterfs: No change in
>>>>>>>> volfile,continuing
>>>>>>>> [2019-03-01 09:23:36.092357] I
>>>>>>>> [glusterfsd-mgmt.c:1894:mgmt_getspec_cbk]
0-glusterfs: No change in
>>>>>>>> volfile,continuing
>>>>>>>> [2019-03-01 09:23:36.094146] I
>>>>>>>> [glusterfsd-mgmt.c:1894:mgmt_getspec_cbk]
0-glusterfs: No change in
>>>>>>>> volfile,continuing
>>>>>>>> [2019-03-01 10:06:24.708082] C
>>>>>>>>
[rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-50: server
>>>>>>>> 192.168.0.56:49156 has not responded in the
last 42 seconds,
>>>>>>>> disconnecting.
>>>>>>>>
>>>>>>>> I don?t know why it happens, s06 server seems
to be reachable.
>>>>>>>>
>>>>>>>> [athena_login2][/users/home/sysm02/]> telnet
192.168.0.56 49156
>>>>>>>> Trying 192.168.0.56...
>>>>>>>> Connected to 192.168.0.56.
>>>>>>>> Escape character is '^]'.
>>>>>>>> ^CConnection closed by foreign host.
>>>>>>>> [athena_login2][/users/home/sysm02/]> ping
192.168.0.56
>>>>>>>> PING 192.168.0.56 (192.168.0.56) 56(84) bytes
of data.
>>>>>>>> 64 bytes from 192.168.0.56: icmp_seq=1 ttl=64
time=0.116 ms
>>>>>>>> 64 bytes from 192.168.0.56: icmp_seq=2 ttl=64
time=0.101 ms
>>>>>>>>
>>>>>>>> --- 192.168.0.56 ping statistics ---
>>>>>>>> 2 packets transmitted, 2 received, 0% packet
loss, time 1528ms
>>>>>>>> rtt min/avg/max/mdev = 0.101/0.108/0.116/0.012
ms
>>>>>>>>
>>>>>>>> [athena_login2][/users/home/sysm02/]> telnet
192.168.0.56 49156
>>>>>>>> Trying 192.168.0.56...
>>>>>>>> Connected to 192.168.0.56.
>>>>>>>> Escape character is '^]'.
>>>>>>>>
>>>>>>>> Thank you for your help,
>>>>>>>> Mauro
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On 1 Mar 2019, at 10:29, Mauro Tridici
<mauro.tridici at cmcc.it>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Hi all,
>>>>>>>>
>>>>>>>> thank you for the explanation.
>>>>>>>> I just changed network.ping-timeout option to
default value
>>>>>>>> (network.ping-timeout=42).
>>>>>>>>
>>>>>>>> I will check the logs to see if the errors will
appear again.
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Mauro
>>>>>>>>
>>>>>>>> On 1 Mar 2019, at 04:43, Milind Changire
<mchangir at redhat.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> network.ping-timeout should not be set to zero
for non-glusterd
>>>>>>>> clients.
>>>>>>>> glusterd is a special case for which
ping-timeout is set to zero
>>>>>>>> via /etc/glusterfs/glusterd.vol
>>>>>>>>
>>>>>>>> Setting network.ping-timeout to zero disables
arming of the ping
>>>>>>>> timer for connections. This disables testing
the connection for
>>>>>>>> responsiveness and hence avoids proactive
fail-over.
>>>>>>>>
>>>>>>>> Please reset network.ping-timeout to a non-zero
positive value, eg.
>>>>>>>> 42
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Feb 28, 2019 at 5:07 PM Nithya
Balachandran <
>>>>>>>> nbalacha at redhat.com> wrote:
>>>>>>>>
>>>>>>>>> Adding Raghavendra and Milind to comment on
this.
>>>>>>>>>
>>>>>>>>> What is the effect of setting
network.ping-timeout to 0 and should
>>>>>>>>> it be set back to 42?
>>>>>>>>> Regards,
>>>>>>>>> Nithya
>>>>>>>>>
>>>>>>>>> On Thu, 28 Feb 2019 at 16:01, Mauro Tridici
<mauro.tridici at cmcc.it>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Nithya,
>>>>>>>>>>
>>>>>>>>>> sorry for the late.
>>>>>>>>>> network.ping-timeout has been set to 0
in order to try to solve
>>>>>>>>>> some timeout problems, but it didn?t
help.
>>>>>>>>>> I can set it to the default value.
>>>>>>>>>>
>>>>>>>>>> Can I proceed with the change?
>>>>>>>>>>
>>>>>>>>>> Thank you,
>>>>>>>>>> Mauro
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 28 Feb 2019, at 04:41, Nithya
Balachandran <
>>>>>>>>>> nbalacha at redhat.com> wrote:
>>>>>>>>>>
>>>>>>>>>> Hi Mauro,
>>>>>>>>>>
>>>>>>>>>> Is network.ping-timeout still set to 0.
The default value is 42.
>>>>>>>>>> Is there a particular reason why this
was changed?
>>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>> Nithya
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Wed, 27 Feb 2019 at 21:32, Mauro
Tridici <
>>>>>>>>>> mauro.tridici at cmcc.it> wrote:
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Hi Xavi,
>>>>>>>>>>>
>>>>>>>>>>> thank you for the detailed
explanation and suggestions.
>>>>>>>>>>> Yes, transport.listen-backlog
option is still set to 1024.
>>>>>>>>>>>
>>>>>>>>>>> I will check the network and
connectivity status using ?ping?
>>>>>>>>>>> and ?telnet? as soon as the errors
will come back again.
>>>>>>>>>>>
>>>>>>>>>>> Regards,
>>>>>>>>>>> Mauro
>>>>>>>>>>>
>>>>>>>>>>> Il giorno 27 feb 2019, alle ore
16:42, Xavi Hernandez <
>>>>>>>>>>> jahernan at redhat.com> ha
scritto:
>>>>>>>>>>>
>>>>>>>>>>> Hi Mauro,
>>>>>>>>>>>
>>>>>>>>>>> those errors say that the mount
point is not connected to some
>>>>>>>>>>> of the bricks while executing
operations. I see references to 3rd and 6th
>>>>>>>>>>> bricks of several disperse sets,
which seem to map to server s06. For some
>>>>>>>>>>> reason, gluster is having troubles
connecting from the client machine to
>>>>>>>>>>> that particular server. At the end
of the log I see that after long time a
>>>>>>>>>>> reconnect is done to both of them.
However little after, other bricks from
>>>>>>>>>>> the s05 get disconnected and a
reconnect times out.
>>>>>>>>>>>
>>>>>>>>>>> That's really odd. It seems
like if server/communication is cut
>>>>>>>>>>> to s06 for some time, then
restored, and then the same happens to the next
>>>>>>>>>>> server.
>>>>>>>>>>>
>>>>>>>>>>> If the servers are really online
and it's only a communication
>>>>>>>>>>> issue, it explains why server
memory and network has increased: if the
>>>>>>>>>>> problem only exists between the
client and servers, any write made by the
>>>>>>>>>>> client will automatically mark the
file as damaged, since some of the
>>>>>>>>>>> servers have not been updated.
Since self-heal runs from the server nodes,
>>>>>>>>>>> they will probably be correctly
connected to all bricks, which allows them
>>>>>>>>>>> to heal the just damaged file,
which increases memory and network usage.
>>>>>>>>>>>
>>>>>>>>>>> I guess you still have
transport.listen-backlog set to 1024,
>>>>>>>>>>> right ?
>>>>>>>>>>>
>>>>>>>>>>> Just to try to identify if the
problem really comes from
>>>>>>>>>>> network, can you check if you lose
some pings from the client to all of the
>>>>>>>>>>> servers while you are seeing those
errors in the log file ?
>>>>>>>>>>>
>>>>>>>>>>> You can also check if during those
errors, you can telnet to the
>>>>>>>>>>> port of the brick from the client.
>>>>>>>>>>>
>>>>>>>>>>> Xavi
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Feb 26, 2019 at 10:17 AM
Mauro Tridici <
>>>>>>>>>>> mauro.tridici at cmcc.it> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi Nithya,
>>>>>>>>>>>>
>>>>>>>>>>>> ?df -h? operation is not still
slow, but no users are using the
>>>>>>>>>>>> volume, RAM and NETWORK usage
is ok on the client node.
>>>>>>>>>>>>
>>>>>>>>>>>> I was worried about this kind
of warnings/errors:
>>>>>>>>>>>>
>>>>>>>>>>>> [2019-02-25 10:59:00.664323] W
[MSGID: 122035]
>>>>>>>>>>>>
[ec-common.c:571:ec_child_select] 0-tier2-disperse-6: Executing operation
>>>>>>>>>>>> with some subvolumes
unavailable (20)
>>>>>>>>>>>>
>>>>>>>>>>>> [2019-02-26 03:11:35.212603] E
>>>>>>>>>>>>
[rpc-clnt.c:350:saved_frames_unwind] (-->
>>>>>>>>>>>>
/usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x153)[0x3d0cc2f2e3] (-->
>>>>>>>>>>>>
/usr/lib64/libgfrpc.so.0(saved_frames_unwind+0x1e5)[0x3d0d410935] (-->
>>>>>>>>>>>>
/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x3d0d410a7e] (-->
>>>>>>>>>>>>
/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xa5)[0x3d0d410b45]
>>>>>>>>>>>> (-->
/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x278)[0x3d0d410e68] )))))
>>>>>>>>>>>> 0-tier2-client-50: forced
unwinding frame type(GlusterFS 3.3) op(WRITE(13))
>>>>>>>>>>>> called at 2019-02-26
03:10:56.549903 (xid=0x106f1c5)
>>>>>>>>>>>>
>>>>>>>>>>>> [2019-02-26 03:13:03.313831] E
>>>>>>>>>>>>
[socket.c:2376:socket_connect_finish] 0-tier2-client-50: connection to
>>>>>>>>>>>> 192.168.0.56:49156 failed
(Timeout della connessione);
>>>>>>>>>>>> disconnecting socket
>>>>>>>>>>>>
>>>>>>>>>>>> It seems that some subvolumes
are not available and
>>>>>>>>>>>> 192.168.0.56 server (s06) is
not reachable.
>>>>>>>>>>>> But gluster servers are
up&running and bricks are ok.
>>>>>>>>>>>>
>>>>>>>>>>>> In attachment the updated
tier2.log file.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Thank you.
>>>>>>>>>>>> Regards,
>>>>>>>>>>>> Mauro
>>>>>>>>>>>>
>>>>>>>>>>>> Il giorno 26 feb 2019, alle ore
04:03, Nithya Balachandran <
>>>>>>>>>>>> nbalacha at redhat.com> ha
scritto:
>>>>>>>>>>>>
>>>>>>>>>>>> Hi,
>>>>>>>>>>>>
>>>>>>>>>>>> I see a lot of EC messages in
the log but they don't seem very
>>>>>>>>>>>> serious. Xavi, can you take a
look?
>>>>>>>>>>>>
>>>>>>>>>>>> The only errors I see are:
>>>>>>>>>>>> [2019-02-25 10:58:45.519871] E
>>>>>>>>>>>>
[rpc-clnt.c:350:saved_frames_unwind] (-->
>>>>>>>>>>>>
/usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x153)[0x3d0cc2f2e3] (-->
>>>>>>>>>>>>
/usr/lib64/libgfrpc.so.0(saved_frames_unwind+0x1e5)[0x3d0d410935] (-->
>>>>>>>>>>>>
/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x3d0d410a7e] (-->
>>>>>>>>>>>>
/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xa5)[0x3d0d410b45]
>>>>>>>>>>>> (-->
/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x278)[0x3d0d410e68] )))))
>>>>>>>>>>>> 0-tier2-client-50: forced
unwinding frame type(GlusterFS 3.3) op(WRITE(13))
>>>>>>>>>>>> called at 2019-02-25
10:57:47.429969 (xid=0xd26fe7)
>>>>>>>>>>>> [2019-02-25 10:58:51.461493] E
>>>>>>>>>>>>
[rpc-clnt.c:350:saved_frames_unwind] (-->
>>>>>>>>>>>>
/usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x153)[0x3d0cc2f2e3] (-->
>>>>>>>>>>>>
/usr/lib64/libgfrpc.so.0(saved_frames_unwind+0x1e5)[0x3d0d410935] (-->
>>>>>>>>>>>>
/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x3d0d410a7e] (-->
>>>>>>>>>>>>
/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xa5)[0x3d0d410b45]
>>>>>>>>>>>> (-->
/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x278)[0x3d0d410e68] )))))
>>>>>>>>>>>> 0-tier2-client-41: forced
unwinding frame type(GlusterFS 3.3) op(WRITE(13))
>>>>>>>>>>>> called at 2019-02-25
10:57:47.499174 (xid=0xf47d6a)
>>>>>>>>>>>> [2019-02-25 11:07:57.152874] E
>>>>>>>>>>>>
[socket.c:2376:socket_connect_finish] 0-tier2-client-70: connection to
>>>>>>>>>>>> 192.168.0.55:49163 failed
(Timeout della connessione);
>>>>>>>>>>>> disconnecting socket
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Is the df -h operation still
slow? If yes, can you take a
>>>>>>>>>>>> tcpdump of the client while
running df -h and send that across?
>>>>>>>>>>>>
>>>>>>>>>>>> Regards,
>>>>>>>>>>>> Nithya
>>>>>>>>>>>>
>>>>>>>>>>>> On Mon, 25 Feb 2019 at 17:27,
Mauro Tridici <
>>>>>>>>>>>> mauro.tridici at cmcc.it>
wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Sorry, some minutes after
my last mail message, I noticed that
>>>>>>>>>>>>> ?df -h? command hanged for
a while before returns the prompt.
>>>>>>>>>>>>> Yesterday, everything was
ok in the gluster client log, but,
>>>>>>>>>>>>> today, I see a lot of
errors (please, take a look to the attached file).
>>>>>>>>>>>>>
>>>>>>>>>>>>> On the client node, I
detected an important RAM e NETWORK
>>>>>>>>>>>>> usage.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Do you think that the
errors have been caused by the client
>>>>>>>>>>>>> resources usage?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thank you in advance,
>>>>>>>>>>>>> Mauro
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>
>>>>>
>>>> -------------------------
>>>> Mauro Tridici
>>>>
>>>> Fondazione CMCC
>>>> CMCC Supercomputing Center
>>>> presso Complesso Ecotekne - Universit? del Salento -
>>>> Strada Prov.le Lecce - Monteroni sn
>>>> 73100 Lecce  IT
>>>> http://www.cmcc.it
>>>>
>>>> mobile: (+39) 327 5630841
>>>> email: mauro.tridici at cmcc.it
>>>> https://it.linkedin.com/in/mauro-tridici-5977238b
>>>>
>>>> _______________________________________________
>>> Gluster-devel mailing list
>>> Gluster-devel at gluster.org
>>> https://lists.gluster.org/mailman/listinfo/gluster-devel
>>
>> +A mailing list of Red Hat associates involved in development, testingand production of RHGS <rhgs-devel at redhat.com>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.gluster.org/pipermail/gluster-users/attachments/20191201/ebf526b6/attachment.html>
Gluster users - Dec 2019 - [Gluster-devel] "rpc_clnt_ping_timer_expired" errors

[Gluster-users] [Gluster-devel] "rpc_clnt_ping_timer_expired" errors