thr3ads.net - samba - [Samba] [ctdb]Unable to run startrecovery event [Sep 2018]

If this information is useful, please help other people find it:
Share via:

zhu.shangzhong at zte.com.cn

2018-Sep-06 06:24 UTC

[Samba] [ctdb]Unable to run startrecovery event

Martin,
I have checked more logs. 
Before ctdb-eventd went away, system memory utilization is very high, almost
100%.
Is it related to "Bad talloc magic value - wrong talloc version
used/mixed"?

2018/08/14 15:22:57.818762 ctdb-eventd[10131]: 05.system: WARNING: System memory
utilization 95% >= threshold 80%
2018/08/14 15:22:57.818800 ctdb-eventd[10131]: 05.system: WARNING: System swap
utilization 28% >= threshold 25%
2018/08/14 15:24:16.584568 ctdb-eventd[10131]: 05.system: WARNING: System memory
utilization 94% >= threshold 80%
2018/08/14 15:24:32.198828 ctdb-eventd[10131]: 05.system: WARNING: System memory
utilization 93% >= threshold 80%
......
......
2018/09/03 12:24:23.585113 ctdb-eventd[10131]: 05.system: WARNING: System swap
utilization 99% >= threshold 25%
2018/09/03 12:24:39.316335 ctdb-eventd[10131]: 05.system: WARNING: System memory
utilization 98% >= threshold 80%
2018/09/03 12:24:55.153527 ctdb-eventd[10131]: 05.system: WARNING: System memory
utilization 99% >= threshold 80%
2018/09/03 12:24:55.153583 ctdb-eventd[10131]: 05.system: WARNING: System swap
utilization 100% >= threshold 25%
2018/09/03 12:25:10.894818 ctdb-eventd[10131]: 05.system: WARNING: System memory
utilization 98% >= threshold 80%
2018/09/03 12:25:26.580002 ctdb-eventd[10131]: 05.system: WARNING: System swap
utilization 99% >= threshold 25%
2018/09/03 12:25:42.285185 ctdb-eventd[10131]: 05.system: WARNING: System memory
utilization 97% >= threshold 80%
2018/09/03 12:25:58.076746 ctdb-eventd[10131]: 05.system: WARNING: System memory
utilization 96% >= threshold 80%
2018/09/03 12:25:58.822490 ctdb-eventd[10131]: run_proc failed for 99.timeout,
ret=17
2018/09/03 12:25:58.822557 ctdb-eventd[10131]: COMMAND_RUN failed
2018/09/03 12:25:58.822569 ctdb-eventd[10131]: client read failed with ret=17

Thanks!

-------------------------------------------------------------
Thanks Martin!
We are using the ctdb 4.6.10.

Are you able to recreate this every time? Sometimes? Rarely?
Rarely.

Note that you're referring to nodes 1, 2, 3 while CTDB numbers the
nodes 0, 1, 2. In fact, the situation is a little more confused than
this:
This is my wrong. The CTDB numbers the nodes is 0,1,2.

# ctdb status
Number of nodes:3
pnn:0 10.231.8.70    OK
pnn:1 10.231.8.68    OK
pnn:2 10.231.8.69    OK (THIS NODE)

#ctdb ip
Public IPs on node 2
10.231.8.68 1
10.231.8.69 2
10.231.8.70 0

-----------------------------------
Re: [Samba] [ctdb]Unable to run startrecovery event(if mail contentis encrypted,
please see the attached file)
Thanks for reporting this.  It looks very interesting and we will fix
it all as soon as we understand it!  :-)

On Wed, 5 Sep 2018 16:29:31 +0800 (CST), "zhu.shangzhong--- via samba"
<samba at lists.samba.org> wrote:
> There is a 3 nodes ctdb cluster is running. When one of 3 nodes is
> powered down, lots of logs will be wrote to log.ctdb.
Can you please let us know what version of Samba/CTDB you're using?

Note that you're referring to nodes 1, 2, 3 while CTDB numbers the
nodes 0, 1, 2.  In fact, the situation is a little more confused than
this:
> Power down node3
> The node1 log is as follow:
> 2018/09/04 04:29:33.402108 ctdbd[10129]: 10.231.8.65:4379: node
10.231.8.67:4379 is dead: 1 connected
> 2018/09/04 04:29:33.414817 ctdbd[10129]: Tearing down connection to dead
node :0
It appears that the node you're calling node 3 is the one CTDB calls
node 0!  Can you please post the output of "ctdb status" when all
nodes
are up and running?

I'm guessing that your nodes file looks like:

10.231.8.67
10.231.8.65
10.231.8.66

This:
> node1: repeat logs:
> 2018/09/04 04:35:06.414369 ctdbd[10129]: Recovery has started
> 2018/09/04 04:35:06.414944 ctdbd[10129]: connect() failed, errno=111
> 2018/09/04 04:35:06.415076 ctdbd[10129]: Unable to run startrecovery event
is due to this:
> 2018/09/04 04:29:55.570212 ctdb-eventd[10131]: Bad talloc magic value -
wrong talloc version used/mixed
> 2018/09/04 04:29:57.240533 ctdbd[10129]: Eventd went away
We have fixed a similar issue in some versions.  When we know what
version you are running then we can say whether it is a known issue or
a new issue.

I have been working on the following issue for most of this week:
> 2018/09/04 04:29:52.465663 ctdbd[10129]: This node (1) is now the recovery
master
> 2018/09/04 04:29:55.468771 ctdb-recoverd[11302]: Election period ended
> 2018/09/04 04:29:55.469404 ctdb-recoverd[11302]: Node 2 has changed flags -
now 0x8  was 0x0
> 2018/09/04 04:29:55.469475 ctdb-recoverd[11302]: Remote node 2 had flags
0x8, local had 0x0 - updating local
> 2018/09/04 04:29:55.469514 ctdb-recoverd[11302]:
../ctdb/server/ctdb_recoverd.c:1267 Starting do_recovery
> 2018/09/04 04:29:55.469525 ctdb-recoverd[11302]: Attempting to take
recovery lock (/share-fs/export/ctdb/.ctdb/reclock)
> 2018/09/04 04:29:55.563522 ctdb-recoverd[11302]: Unable to take recovery
lock - contention
> 2018/09/04 04:29:55.563573 ctdb-recoverd[11302]: Unable to get recovery
lock - aborting recovery and ban ourself for 300 seconds
> 2018/09/04 04:29:55.563585 ctdb-recoverd[11302]: Banning node 1 for 300
seconds
Are you able to recreate this every time?  Sometimes?  Rarely?

I hadn't seen this until recently and I'm now worried that it is more
widespread than we realise.

Thanks...

peace & happiness,
martin
--
To unsubscribe from this list go to the following URL and read the
instructions:  https://lists.samba.org/mailman/options/samba

Martin Schwenke

2018-Sep-06 08:58 UTC

head link

[Samba] [ctdb]Unable to run startrecovery event

On Thu, 6 Sep 2018 14:24:12 +0800 (CST), "zhu.shangzhong--- via samba"
<samba at lists.samba.org> wrote:
> I have checked more logs. 
> Before ctdb-eventd went away, system memory utilization is very high,
almost 100%.
> Is it related to "Bad talloc magic value - wrong talloc version
used/mixed"?
Definitely.  The only time we have seen this in eventd is when a memory
allocation failed.  The fix is commit
e4a5d610b8e81c78b7d98217bc87c4b815b4c4e7.  However, I don't think we
backported this because in the same situation we saw several other
out-of-memory symptoms that are more difficult to fix, so it seemed
pointless.

This fix did make it into 4.9, which should be released next week.

If you really would like to see a backport to other releases then
please open a bug at https://bugzilla.samba.org/ . However, please note
that 4.6 is now security fixes only and this isn't security related.
When 4.9 is released then 4.7 will also be security fixes only.

I'm still working on the recovery lock issue...

peace & happiness,
martin

Possibly Parallel Threads

Search for more reasonably related threads

samba - Sep 2018 - [ctdb]Unable to run startrecovery event

[Samba] [ctdb]Unable to run startrecovery event

[Samba] [ctdb]Unable to run startrecovery event

Possibly Parallel Threads