thr3ads.net - samba - [Samba] db_ctdb_fetch_locked: folder lock contention of over 4 million ms (over an hour) [Oct 2021]

If this information is useful, please help other people find it:
Share via:

Isaac Stone

2021-Oct-14 21:20 UTC

[Samba] db_ctdb_fetch_locked: folder lock contention of over 4 million ms (over an hour)

Hi, same problem happened again this morning, I have some more context from
the logs

There was an election triggered by a dead node

In our configuration only nodes 0 and 2 have `lmaster capability=true` and
`recmaster capability=true` in /etc/ctdb/ctdb.conf.
On other nodes these are set to false.

Looking at this

Oct 14 12:52:26 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
Tearing down connection to dead node :2
Oct 14 12:52:26 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
10.1.3.19:4379: node 10.5.21.119:4379 is dead: 3 connected
Oct 14 12:52:26 samba-server-s02-usw2-use1-follower-01
ctdb-recoverd[32407]: Recmaster node 2 is disconnected/deleted. Force
election
Oct 14 12:52:26 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
Recovery mode set to ACTIVE
Oct 14 12:52:27 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: This
node (4) is now the recovery master
Oct 14 12:52:27 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: Remote
node (3) is now the recovery master
Oct 14 12:52:32 samba-server-s02-usw2-use1-follower-01
ctdb-recoverd[32407]: Election period ended, master=0

Seems there was a network interruption in connection to node 2, which was
the recmaster/lmaster

Does it mean nodes 3 and 4 were both elected? Or does `master=0` mean 0 is
master?

node 0 _should_ be the master.

This seems to be the start of the problem

Oct 14 12:52:38 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
Recovery has started
Oct 14 12:52:39 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: Freeze
db: smbXsrv_open_global.tdb
Oct 14 12:52:39 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: Freeze
db: leases.tdb
Oct 14 12:52:39 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: Freeze
db: locking.tdb
Oct 14 12:52:39 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: Freeze
db: brlock.tdb
Oct 14 12:52:39 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: Freeze
db: smbXsrv_tcon_global.tdb
Oct 14 12:52:39 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: Freeze
db: smbXsrv_session_global.tdb
Oct 14 12:52:39 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: Freeze
db: smbXsrv_client_global.tdb
Oct 14 12:52:39 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: Freeze
db: smbXsrv_version_global.tdb
Oct 14 12:52:39 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: Freeze
db: netlogon_creds_cli.tdb
Oct 14 12:52:39 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: Freeze
db: g_lock.tdb
Oct 14 12:52:39 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: Freeze
db: printer_list.tdb
Oct 14 12:52:39 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: Freeze
db: share_info.tdb
Oct 14 12:52:39 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: Freeze
db: secrets.tdb
Oct 14 12:52:39 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: Freeze
db: account_policy.tdb
Oct 14 12:52:39 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: Freeze
db: registry.tdb
Oct 14 12:52:39 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: Freeze
db: group_mapping.tdb
Oct 14 12:52:39 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: Freeze
db: passdb.tdb
Oct 14 12:52:39 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: Freeze
db: winbindd_idmap.tdb
Oct 14 12:52:39 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: Freeze
db: ctdb.tdb
Oct 14 12:52:41 samba-server-s02-usw2-use1-follower-01
process-agent[575510]: 2021-10-14 12:52:41 UTC | PROCESS | INFO |
(collector.go:160 in runCheck) | Finish container check #846640 in 19.823?s
Oct 14 12:52:50 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: Thaw
db: share_info.tdb generation 1805205230
Oct 14 12:52:50 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
Release freeze handle for db share_info.tdb
Oct 14 12:52:50 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: Thaw
db: secrets.tdb generation 1805205230
Oct 14 12:52:50 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
Release freeze handle for db secrets.tdb
Oct 14 12:52:50 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: Thaw
db: account_policy.tdb generation 1805205230
Oct 14 12:52:50 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
Release freeze handle for db account_policy.tdb
Oct 14 12:52:50 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: Thaw
db: passdb.tdb generation 1805205230
Oct 14 12:52:50 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
Release freeze handle for db passdb.tdb
Oct 14 12:52:50 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: Thaw
db: winbindd_idmap.tdb generation 1805205230
Oct 14 12:52:50 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
Release freeze handle for db winbindd_idmap.tdb
Oct 14 12:52:50 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: Thaw
db: group_mapping.tdb generation 1805205230
Oct 14 12:52:50 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
Release freeze handle for db group_mapping.tdb
Oct 14 12:52:50 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: Thaw
db: registry.tdb generation 1805205230
Oct 14 12:52:50 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
Release freeze handle for db registry.tdb
Oct 14 12:52:50 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: Thaw
db: ctdb.tdb generation 1805205230
Oct 14 12:52:50 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
Release freeze handle for db ctdb.tdb
Oct 14 12:52:50 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: Thaw
db: smbXsrv_client_global.tdb generation 1805205230
Oct 14 12:52:50 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
Release freeze handle for db smbXsrv_client_global.tdb
Oct 14 12:52:50 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: Thaw
db: smbXsrv_tcon_global.tdb generation 1805205230
Oct 14 12:52:50 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
Release freeze handle for db smbXsrv_tcon_global.tdb
Oct 14 12:52:50 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: Thaw
db: smbXsrv_version_global.tdb generation 1805205230
Oct 14 12:52:50 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
Release freeze handle for db smbXsrv_version_global.tdb
Oct 14 12:52:51 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
Clearing the tracking database for dbid 0x2d608c16
Oct 14 12:52:53 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
10.1.3.19:4379: connected to 10.5.21.119:4379 - 4 connected

Reconnected to the former master here

Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: Thaw
db: netlogon_creds_cli.tdb generation 1805205230
Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
Release freeze handle for db netlogon_creds_cli.tdb
Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: Thaw
db: g_lock.tdb generation 1805205230
Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
Release freeze handle for db g_lock.tdb
Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: Thaw
db: brlock.tdb generation 1805205230
Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
Release freeze handle for db brlock.tdb
Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: Thaw
db: smbXsrv_open_global.tdb generation 1805205230
Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
Release freeze handle for db smbXsrv_open_global.tdb
Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: resent
ctdb_call for db smbXsrv_open_global.tdb reqid 232936286 generation
1805205230
Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: resent
ctdb_call for db smbXsrv_open_global.tdb reqid 232936287 generation
1805205230
Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: resent
ctdb_call for db smbXsrv_open_global.tdb reqid 232936288 generation
1805205230
Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: resent
ctdb_call for db smbXsrv_open_global.tdb reqid 232936289 generation
1805205230
Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: Thaw
db: locking.tdb generation 1805205230
Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
Release freeze handle for db locking.tdb
Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: resent
ctdb_call for db locking.tdb reqid 232936290 generation 1805205230
Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: resent
ctdb_call for db locking.tdb reqid 232936291 generation 1805205230
Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: resent
ctdb_call for db locking.tdb reqid 232936292 generation 1805205230
Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: resent
ctdb_call for db locking.tdb reqid 232936293 generation 1805205230
Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: resent
ctdb_call for db locking.tdb reqid 232936294 generation 1805205230
Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: resent
ctdb_call for db locking.tdb reqid 232936295 generation 1805205230
Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: resent
ctdb_call for db locking.tdb reqid 232936296 generation 1805205230
Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: resent
ctdb_call for db locking.tdb reqid 232936297 generation 1805205230
Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: resent
ctdb_call for db locking.tdb reqid 232936298 generation 1805205230
Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: resent
ctdb_call for db locking.tdb reqid 232936299 generation 1805205230

... 290 total reqids resent

Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: resent
ctdb_call for db locking.tdb reqid 232936572 generation 1805205230
Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: resent
ctdb_call for db locking.tdb reqid 232936573 generation 1805205230
Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: resent
ctdb_call for db locking.tdb reqid 232936574 generation 1805205230
Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: resent
ctdb_call for db locking.tdb reqid 232936575 generation 1805205230
Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: resent
ctdb_call for db locking.tdb reqid 232936576 generation 1805205230
Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: Thaw
db: printer_list.tdb generation 1805205230
Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
Release freeze handle for db printer_list.tdb
Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: Thaw
db: smbXsrv_session_global.tdb generation 1805205230
Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
Release freeze handle for db smbXsrv_session_global.tdb
Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 smbd[3530045]:
[2021/10/14 12:52:56.088975,  0]
../../source3/lib/dbwrap/dbwrap_ctdb.c:1222(fetch_locked_internal)
Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 smbd[188298]:
[2021/10/14 12:52:56.089017,  0]
../../source3/lib/dbwrap/dbwrap_ctdb.c:1222(fetch_locked_internal)
Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 smbd[2065835]:
[2021/10/14 12:52:56.088993,  0]
../../source3/lib/dbwrap/dbwrap_ctdb.c:1222(fetch_locked_internal)
Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 smbd[173413]:
[2021/10/14 12:52:56.089028,  0]
../../source3/lib/dbwrap/dbwrap_ctdb.c:1222(fetch_locked_internal)
Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 smbd[3530045]:
 db_ctdb_fetch_locked for
/var/cache/dbdir/volatile/smbXsrv_open_global.tdb.4 key E9E66C7E, chain
48608 needed 1 attempts, 25787 milliseconds, chainlock: 0.034000 ms, CTDB
25787.146000 ms
Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 smbd[173413]:
 db_ctdb_fetch_locked for
/var/cache/dbdir/volatile/smbXsrv_open_global.tdb.4 key 45E18236, chain
61197 needed 1 attempts, 29600 milliseconds, chainlock: 0.018000 ms, CTDB
29600.219000 ms
Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 smbd[2065835]:
 db_ctdb_fetch_locked for
/var/cache/dbdir/volatile/smbXsrv_open_global.tdb.4 key C8D42267, chain
20590 needed 1 attempts, 26734 milliseconds, chainlock: 0.029000 ms, CTDB
26734.891000 ms
Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 smbd[188298]:
 db_ctdb_fetch_locked for
/var/cache/dbdir/volatile/smbXsrv_open_global.tdb.4 key 06EED6D3, chain
11119 needed 1 attempts, 27704 milliseconds, chainlock: 0.026000 ms, CTDB
27704.504000 ms
Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: Thaw
db: leases.tdb generation 1805205230
Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
Release freeze handle for db leases.tdb
Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 smbd[1475003]:
[2021/10/14 12:52:56.155650,  0]
../../source3/lib/dbwrap/dbwrap_ctdb.c:1222(fetch_locked_internal)
Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 smbd[1475003]:
 db_ctdb_fetch_locked for /var/cache/dbdir/volatile/locking.tdb.4 key
7A70496E4466F9AD0000CE3D080000000000000000000000, chain 28289 needed 1
attempts, 5792 milliseconds, chainlock: 5645.945000 ms, CTDB 146.268000 ms
Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 smbd[1482694]:
[2021/10/14 12:52:56.155717,  0]
../../source3/lib/dbwrap/dbwrap_ctdb.c:1222(fetch_locked_internal)
Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 smbd[1482694]:
 db_ctdb_fetch_locked for
/var/cache/dbdir/volatile/smbXsrv_session_global.tdb.4 key 861E88DE, chain
99225 needed 1 attempts, 5454 milliseconds, chainlock: 5309.574000 ms, CTDB
144.484000 ms
Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 smbd[1482030]:
[2021/10/14 12:52:56.235506,  0]
../../source3/lib/dbwrap/dbwrap_ctdb.c:1222(fetch_locked_internal)
Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 smbd[1482030]:
 db_ctdb_fetch_locked for /var/cache/dbdir/volatile/locking.tdb.4 key
7A70496E4466F9AD0F3DF538430000000000000000000000, chain 57341 needed 1
attempts, 103316 milliseconds, chainlock: 0.064000 ms, CTDB 103316.804000 ms
Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 smbd[1475823]:
[2021/10/14 12:52:56.244963,  0]
../../source3/lib/dbwrap/dbwrap_ctdb.c:1222(fetch_locked_internal)
Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 smbd[1475823]:
 db_ctdb_fetch_locked for /var/cache/dbdir/volatile/locking.tdb.4 key
7A70496E4466F9ADEF084B9B2C0000000000000000000000, chain 17348 needed 1
attempts, 849678 milliseconds, chainlock: 0.024000 ms, CTDB 849678.676000
msd 232936329 generation 1805205230

etc etc until we restart the whole cluster

(high hopcount logs happening at the same time, though not as high as
before)



On Mon, Oct 11, 2021 at 4:18 AM Martin Schwenke <martin at meltin.net>
wrote:
> On Fri, 8 Oct 2021 10:27:09 -0700, Isaac Stone <isaac.stone at
som.com>
> wrote:
>
> > Had the same problem again yesterday, only different.
> >
> > - same problem in that there was a single folder no one could access.
> >
> > - different in that the errors logged were different
> >
> > Instead of db_ctdb_fetch_locked errors in the smbd logs, we only had
high
> > hopcount errors in the ctdbd logs, and only on two of the nodes
> >
> > hopcout was also incredibly high and numerous, logging upwords of 60
> times
> > a second with hopcounts in excess of 500 thousand.
> >
> > The errors were all for the exact same key.
> >
> > On node 2:
> >
> >  High hopcount 430199 dbid:locking.tdb key:0x5930e2c3 reqid=003ef9ef
> pnn:3
> > src:0 lmaster:4 header->dmaster:2 dst:4
> >
> > On node 3:
> >
> >  High hopcount 350598 dbid:locking.tdb key:0x5930e2c3 reqid=0e34f704
> pnn:4
> > src:4 lmaster:4 header->dmaster:3 dst:3
> >
> > We fixed the error same as last time - bring down all nodes, wipe the
> > /volatile/ directory, and bring the cluster back up.
>
> That's bad.  That indicates that something is getting corrupted
> somewhere.  Is there any indication that you're running out of disk
> space or memory?
>
> Quite a few years ago there were some races in vacuuming (garbage
> collection) where this could happen but they were fixed.  Last time we
> saw this sort of thing we audited the code and couldn't find a
> (relevant) bug.   For the situation we were looking at I'm quite sure
> that we tracked it down to running out of something (disk?  memory?)
> and being unable to recover.  If this is happening repeatedly there
> should be a common clue in the logs, around the time it starts.
>
> > Questions:
> >
> > Is there a way to correlate the key in the "high hopcount"
logs to a
> > specific file lock? Something like `net tdb locking`?
>
> The key in the high hopcount logs is a hash of the key, so it is one
> way.  We really should print the hex key.  If you can guess the key
> then you can generate the hash (using
> lib/tdb/common/hash.c:tdb_jenkins_hash()) and confirm.
>
> > Is there a way to diagnose and fix these problems without bringing the
> > whole cluster down and wiping all locks everywhere? It is effective
but
> > also causes systemwide service interruptions.
>
> With slightly improved logging that might be possible.  However, there
> is a catch-22.  If you shut down all of the nodes then you disconnect
> the clients and you don't risk filesystem data corruption.  If you
> figure out the key, and figure out how to remove it from the databases
> on all nodes (undocumented "ctdb tstore" with empty data *might*
do what
> you want - I'd have to read code to confirm, but need to sleep instead)
> then you'll be alive but you risk filesystem data corruption.  The
> simplified reasoning here is that you may have a client that thinks it
> has a lock on a file or directory (and is modifying it), you'll
> effectively release the lock by deleting the record, and then other
> clients will also get access (and also modify it).
>
> > I noticed the `setdbsticky` option in the man page. It says it is
> > experimental, but also indicates it may solve this problem. Is there
any
> > more information on this option? I am hesitant to recommend
experimental
> > settings for prod.
>
> Hop counts that high indicate some sort of corruption.  It seems very
> unlikely that this is just a performance problem.
>
> > I am leaning towards changing fileid:algorithm to fsname_nodirs as a
> > solution to this recurring problem, but I want to understand the
problem
> > better before making that change. Any advice is greatly appreciated.
>
> I'm trying to get an overview of CTDB performance issues here:
>
>   https://wiki.samba.org/index.php/CTDB_Performance
>
> Nobody has fully explained the implications of that option there.  You
> have to understand that it breaks lock coherency (i.e. effectively no
> locking on directories) and the implications of that.
>
> Sorry I can't be more help right now... busy day, need sleep...
>
> peace & happiness,
> martin
>
>

Martin Schwenke

2021-Oct-15 06:48 UTC

head link

[Samba] db_ctdb_fetch_locked: folder lock contention of over 4 million ms (over an hour)

Hi Isaac,

Comments inline below...

On Thu, 14 Oct 2021 14:20:38 -0700, Isaac Stone <isaac.stone at som.com>
wrote:
> Hi, same problem happened again this morning, I have some more context from
> the logs
> 
> There was an election triggered by a dead node
> 
> In our configuration only nodes 0 and 2 have `lmaster capability=true` and
> `recmaster capability=true` in /etc/ctdb/ctdb.conf.
> On other nodes these are set to false.
> 
> Looking at this
> 
> Oct 14 12:52:26 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> Tearing down connection to dead node :2
> Oct 14 12:52:26 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> 10.1.3.19:4379: node 10.5.21.119:4379 is dead: 3 connected
> Oct 14 12:52:26 samba-server-s02-usw2-use1-follower-01
> ctdb-recoverd[32407]: Recmaster node 2 is disconnected/deleted. Force
> election
> Oct 14 12:52:26 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> Recovery mode set to ACTIVE
> Oct 14 12:52:27 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: This
> node (4) is now the recovery master
> Oct 14 12:52:27 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: Remote
> node (3) is now the recovery master
> Oct 14 12:52:32 samba-server-s02-usw2-use1-follower-01
> ctdb-recoverd[32407]: Election period ended, master=0
> 
> Seems there was a network interruption in connection to node 2, which was
> the recmaster/lmaster
> 
> Does it mean nodes 3 and 4 were both elected? Or does `master=0` mean 0 is
> master?
> 
> node 0 _should_ be the master.
No, it means that nodes 4 and then 3 won rounds of the election.  If
you look in the log on node 3 you'll probably find that it noticed the
dead node at about the same time, so also initiated an election.  Both
nodes will have let the other win.  I now have a patch to demote those
election progress messages to INFO level.

The final message indicates who won (i.e. 0) - the election ends via a
timeout.

So, this is a bug (and those messages were useful).  A node that can't
win an election shouldn't start an election with a standard vote for
itself. A node that can't win an election should probably go into
recovery, set the master to the unknown value and then wait for another
node to win an election.  The biggest problem is if no node can be
master then the cluster sits in recovery until a potential master
reappears. We could treat the recmaster capability as a preference and
allow 2nd tier nodes to be recmaster but that is probably confusing...
and much harder to implement.

I won't try to fix this now because that would touch code that I'm
modifying for another change.  These other changes should also avoid
the above confusion if you're using the recovery lock.  I'm planning to
convert the recovery lock to a cluster lock.  If this is in use then
the first node to take the lock is leader/recovery-master and no other
form of election is run.  This avoids a class of problems where nodes
elect themselves before they are connected to other nodes and then are
unable to take the lock, so may get banned... or where there are other
election bugs...  ;-)

After I've done that, I'll trivially fix this problem.

Note that lmaster is a per-record concept:

 
https://wiki.samba.org/index.php/CTDB_database_design#Distributed_database_design

The capability indicates a node's ability to be lmaster for records.
> This seems to be the start of the problem
> 
> Oct 14 12:52:38 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> Recovery has started
> Oct 14 12:52:39 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: Freeze
> db: smbXsrv_open_global.tdb
> Oct 14 12:52:39 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: Freeze
> db: leases.tdb
> Oct 14 12:52:39 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: Freeze
> db: locking.tdb
> Oct 14 12:52:39 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: Freeze
> db: brlock.tdb
> Oct 14 12:52:39 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: Freeze
> db: smbXsrv_tcon_global.tdb
> Oct 14 12:52:39 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: Freeze
> db: smbXsrv_session_global.tdb
> Oct 14 12:52:39 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: Freeze
> db: smbXsrv_client_global.tdb
> Oct 14 12:52:39 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: Freeze
> db: smbXsrv_version_global.tdb
> Oct 14 12:52:39 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: Freeze
> db: netlogon_creds_cli.tdb
> Oct 14 12:52:39 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: Freeze
> db: g_lock.tdb
> Oct 14 12:52:39 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: Freeze
> db: printer_list.tdb
> Oct 14 12:52:39 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: Freeze
> db: share_info.tdb
> Oct 14 12:52:39 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: Freeze
> db: secrets.tdb
> Oct 14 12:52:39 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: Freeze
> db: account_policy.tdb
> Oct 14 12:52:39 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: Freeze
> db: registry.tdb
> Oct 14 12:52:39 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: Freeze
> db: group_mapping.tdb
> Oct 14 12:52:39 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: Freeze
> db: passdb.tdb
> Oct 14 12:52:39 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: Freeze
> db: winbindd_idmap.tdb
> Oct 14 12:52:39 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: Freeze
> db: ctdb.tdb
> Oct 14 12:52:41 samba-server-s02-usw2-use1-follower-01
> process-agent[575510]: 2021-10-14 12:52:41 UTC | PROCESS | INFO |
> (collector.go:160 in runCheck) | Finish container check #846640 in 19.823?s
> Oct 14 12:52:50 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: Thaw
> db: share_info.tdb generation 1805205230
> Oct 14 12:52:50 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> Release freeze handle for db share_info.tdb
> Oct 14 12:52:50 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: Thaw
> db: secrets.tdb generation 1805205230
> Oct 14 12:52:50 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> Release freeze handle for db secrets.tdb
> Oct 14 12:52:50 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: Thaw
> db: account_policy.tdb generation 1805205230
> Oct 14 12:52:50 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> Release freeze handle for db account_policy.tdb
> Oct 14 12:52:50 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: Thaw
> db: passdb.tdb generation 1805205230
> Oct 14 12:52:50 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> Release freeze handle for db passdb.tdb
> Oct 14 12:52:50 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: Thaw
> db: winbindd_idmap.tdb generation 1805205230
> Oct 14 12:52:50 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> Release freeze handle for db winbindd_idmap.tdb
> Oct 14 12:52:50 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: Thaw
> db: group_mapping.tdb generation 1805205230
> Oct 14 12:52:50 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> Release freeze handle for db group_mapping.tdb
> Oct 14 12:52:50 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: Thaw
> db: registry.tdb generation 1805205230
> Oct 14 12:52:50 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> Release freeze handle for db registry.tdb
> Oct 14 12:52:50 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: Thaw
> db: ctdb.tdb generation 1805205230
> Oct 14 12:52:50 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> Release freeze handle for db ctdb.tdb
> Oct 14 12:52:50 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: Thaw
> db: smbXsrv_client_global.tdb generation 1805205230
> Oct 14 12:52:50 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> Release freeze handle for db smbXsrv_client_global.tdb
> Oct 14 12:52:50 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: Thaw
> db: smbXsrv_tcon_global.tdb generation 1805205230
> Oct 14 12:52:50 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> Release freeze handle for db smbXsrv_tcon_global.tdb
> Oct 14 12:52:50 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: Thaw
> db: smbXsrv_version_global.tdb generation 1805205230
> Oct 14 12:52:50 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> Release freeze handle for db smbXsrv_version_global.tdb
> Oct 14 12:52:51 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> Clearing the tracking database for dbid 0x2d608c16
> Oct 14 12:52:53 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> 10.1.3.19:4379: connected to 10.5.21.119:4379 - 4 connected
> 
> Reconnected to the former master here
> 
> Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: Thaw
> db: netlogon_creds_cli.tdb generation 1805205230
> Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> Release freeze handle for db netlogon_creds_cli.tdb
> Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: Thaw
> db: g_lock.tdb generation 1805205230
> Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> Release freeze handle for db g_lock.tdb
> Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: Thaw
> db: brlock.tdb generation 1805205230
> Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> Release freeze handle for db brlock.tdb
> Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: Thaw
> db: smbXsrv_open_global.tdb generation 1805205230
> Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> Release freeze handle for db smbXsrv_open_global.tdb
> Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: resent
> ctdb_call for db smbXsrv_open_global.tdb reqid 232936286 generation
> 1805205230
> Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: resent
> ctdb_call for db smbXsrv_open_global.tdb reqid 232936287 generation
> 1805205230
> Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: resent
> ctdb_call for db smbXsrv_open_global.tdb reqid 232936288 generation
> 1805205230
> Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: resent
> ctdb_call for db smbXsrv_open_global.tdb reqid 232936289 generation
> 1805205230
> Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: Thaw
> db: locking.tdb generation 1805205230
> Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> Release freeze handle for db locking.tdb
> Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: resent
> ctdb_call for db locking.tdb reqid 232936290 generation 1805205230
> Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: resent
> ctdb_call for db locking.tdb reqid 232936291 generation 1805205230
> Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: resent
> ctdb_call for db locking.tdb reqid 232936292 generation 1805205230
> Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: resent
> ctdb_call for db locking.tdb reqid 232936293 generation 1805205230
> Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: resent
> ctdb_call for db locking.tdb reqid 232936294 generation 1805205230
> Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: resent
> ctdb_call for db locking.tdb reqid 232936295 generation 1805205230
> Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: resent
> ctdb_call for db locking.tdb reqid 232936296 generation 1805205230
> Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: resent
> ctdb_call for db locking.tdb reqid 232936297 generation 1805205230
> Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: resent
> ctdb_call for db locking.tdb reqid 232936298 generation 1805205230
> Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: resent
> ctdb_call for db locking.tdb reqid 232936299 generation 1805205230
> 
> ... 290 total reqids resent
> 
> Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: resent
> ctdb_call for db locking.tdb reqid 232936572 generation 1805205230
> Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: resent
> ctdb_call for db locking.tdb reqid 232936573 generation 1805205230
> Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: resent
> ctdb_call for db locking.tdb reqid 232936574 generation 1805205230
> Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: resent
> ctdb_call for db locking.tdb reqid 232936575 generation 1805205230
> Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: resent
> ctdb_call for db locking.tdb reqid 232936576 generation 1805205230
I also now have a patch to log per-database summaries of resent calls
at NOTICE level and demote the per-call messages to INFO level. ;-)
> Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: Thaw
> db: printer_list.tdb generation 1805205230
> Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> Release freeze handle for db printer_list.tdb
> Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: Thaw
> db: smbXsrv_session_global.tdb generation 1805205230
> Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> Release freeze handle for db smbXsrv_session_global.tdb
> Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 smbd[3530045]:
> [2021/10/14 12:52:56.088975,  0]
> ../../source3/lib/dbwrap/dbwrap_ctdb.c:1222(fetch_locked_internal)
> Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 smbd[188298]:
> [2021/10/14 12:52:56.089017,  0]
> ../../source3/lib/dbwrap/dbwrap_ctdb.c:1222(fetch_locked_internal)
> Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 smbd[2065835]:
> [2021/10/14 12:52:56.088993,  0]
> ../../source3/lib/dbwrap/dbwrap_ctdb.c:1222(fetch_locked_internal)
> Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 smbd[173413]:
> [2021/10/14 12:52:56.089028,  0]
> ../../source3/lib/dbwrap/dbwrap_ctdb.c:1222(fetch_locked_internal)
> Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 smbd[3530045]:
>  db_ctdb_fetch_locked for
> /var/cache/dbdir/volatile/smbXsrv_open_global.tdb.4 key E9E66C7E, chain
> 48608 needed 1 attempts, 25787 milliseconds, chainlock: 0.034000 ms, CTDB
> 25787.146000 ms
> Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 smbd[173413]:
>  db_ctdb_fetch_locked for
> /var/cache/dbdir/volatile/smbXsrv_open_global.tdb.4 key 45E18236, chain
> 61197 needed 1 attempts, 29600 milliseconds, chainlock: 0.018000 ms, CTDB
> 29600.219000 ms
> Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 smbd[2065835]:
>  db_ctdb_fetch_locked for
> /var/cache/dbdir/volatile/smbXsrv_open_global.tdb.4 key C8D42267, chain
> 20590 needed 1 attempts, 26734 milliseconds, chainlock: 0.029000 ms, CTDB
> 26734.891000 ms
> Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 smbd[188298]:
>  db_ctdb_fetch_locked for
> /var/cache/dbdir/volatile/smbXsrv_open_global.tdb.4 key 06EED6D3, chain
> 11119 needed 1 attempts, 27704 milliseconds, chainlock: 0.026000 ms, CTDB
> 27704.504000 ms
> Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: Thaw
> db: leases.tdb generation 1805205230
> Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> Release freeze handle for db leases.tdb
> Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 smbd[1475003]:
> [2021/10/14 12:52:56.155650,  0]
> ../../source3/lib/dbwrap/dbwrap_ctdb.c:1222(fetch_locked_internal)
> Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 smbd[1475003]:
>  db_ctdb_fetch_locked for /var/cache/dbdir/volatile/locking.tdb.4 key
> 7A70496E4466F9AD0000CE3D080000000000000000000000, chain 28289 needed 1
> attempts, 5792 milliseconds, chainlock: 5645.945000 ms, CTDB 146.268000 ms
> Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 smbd[1482694]:
> [2021/10/14 12:52:56.155717,  0]
> ../../source3/lib/dbwrap/dbwrap_ctdb.c:1222(fetch_locked_internal)
> Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 smbd[1482694]:
>  db_ctdb_fetch_locked for
> /var/cache/dbdir/volatile/smbXsrv_session_global.tdb.4 key 861E88DE, chain
> 99225 needed 1 attempts, 5454 milliseconds, chainlock: 5309.574000 ms, CTDB
> 144.484000 ms
> Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 smbd[1482030]:
> [2021/10/14 12:52:56.235506,  0]
> ../../source3/lib/dbwrap/dbwrap_ctdb.c:1222(fetch_locked_internal)
> Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 smbd[1482030]:
>  db_ctdb_fetch_locked for /var/cache/dbdir/volatile/locking.tdb.4 key
> 7A70496E4466F9AD0F3DF538430000000000000000000000, chain 57341 needed 1
> attempts, 103316 milliseconds, chainlock: 0.064000 ms, CTDB 103316.804000
ms
> Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 smbd[1475823]:
> [2021/10/14 12:52:56.244963,  0]
> ../../source3/lib/dbwrap/dbwrap_ctdb.c:1222(fetch_locked_internal)
> Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 smbd[1475823]:
>  db_ctdb_fetch_locked for /var/cache/dbdir/volatile/locking.tdb.4 key
> 7A70496E4466F9ADEF084B9B2C0000000000000000000000, chain 17348 needed 1
> attempts, 849678 milliseconds, chainlock: 0.024000 ms, CTDB 849678.676000
> msd 232936329 generation 1805205230
> 
> etc etc until we restart the whole cluster
> 
> (high hopcount logs happening at the same time, though not as high as
> before)
So, recovery started just after 12:52:32 and the node went away ~25
seconds before 12:52:26, so ~12:52:01.  Note that quite a few of the
warnings are explained by either the node going away or the recovery -
anything less than ~55 seconds.

There are some outliers here at ~103s and ~850s, so I don't think this
is where the problem started.  I think this is where it ended!  My gut
tells me to go back ~850s and see what was happening on node 2.  The
problem was occurring then on some node and the recovery seems to have
solved it with node 2 excluded.

Still no explanation for the high hop counts.  If they are fairly small
(100s, not 1000s) then they may just indicate contention for records,
and given your geographically distributed nodes, this may be a problem
that you can't work around without read-only or sticky records.

If you continue to see huge hop counts then you need to understand the
source of the corruption.

In future, please consider attaching the log file and just quoting the
relevant entries.  Lines seem to have wrapped at ~80 characters, which
made this quite difficult to browse...  :-)

Thanks...

peace & happiness,
martin

samba - Oct 2021 - db_ctdb_fetch_locked: folder lock contention of over 4 million ms (over an hour)

[Samba] db_ctdb_fetch_locked: folder lock contention of over 4 million ms (over an hour)

[Samba] db_ctdb_fetch_locked: folder lock contention of over 4 million ms (over an hour)