thr3ads.net - samba - [Samba] db_ctdb_fetch_locked: folder lock contention of over 4 million ms (over an hour) [Oct 2021]

If this information is useful, please help other people find it:
Share via:

Martin Schwenke

2021-Oct-15 06:48 UTC

[Samba] db_ctdb_fetch_locked: folder lock contention of over 4 million ms (over an hour)

Hi Isaac,

Comments inline below...

On Thu, 14 Oct 2021 14:20:38 -0700, Isaac Stone <isaac.stone at som.com>
wrote:
> Hi, same problem happened again this morning, I have some more context from
> the logs
> 
> There was an election triggered by a dead node
> 
> In our configuration only nodes 0 and 2 have `lmaster capability=true` and
> `recmaster capability=true` in /etc/ctdb/ctdb.conf.
> On other nodes these are set to false.
> 
> Looking at this
> 
> Oct 14 12:52:26 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> Tearing down connection to dead node :2
> Oct 14 12:52:26 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> 10.1.3.19:4379: node 10.5.21.119:4379 is dead: 3 connected
> Oct 14 12:52:26 samba-server-s02-usw2-use1-follower-01
> ctdb-recoverd[32407]: Recmaster node 2 is disconnected/deleted. Force
> election
> Oct 14 12:52:26 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> Recovery mode set to ACTIVE
> Oct 14 12:52:27 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: This
> node (4) is now the recovery master
> Oct 14 12:52:27 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: Remote
> node (3) is now the recovery master
> Oct 14 12:52:32 samba-server-s02-usw2-use1-follower-01
> ctdb-recoverd[32407]: Election period ended, master=0
> 
> Seems there was a network interruption in connection to node 2, which was
> the recmaster/lmaster
> 
> Does it mean nodes 3 and 4 were both elected? Or does `master=0` mean 0 is
> master?
> 
> node 0 _should_ be the master.
No, it means that nodes 4 and then 3 won rounds of the election.  If
you look in the log on node 3 you'll probably find that it noticed the
dead node at about the same time, so also initiated an election.  Both
nodes will have let the other win.  I now have a patch to demote those
election progress messages to INFO level.

The final message indicates who won (i.e. 0) - the election ends via a
timeout.

So, this is a bug (and those messages were useful).  A node that can't
win an election shouldn't start an election with a standard vote for
itself. A node that can't win an election should probably go into
recovery, set the master to the unknown value and then wait for another
node to win an election.  The biggest problem is if no node can be
master then the cluster sits in recovery until a potential master
reappears. We could treat the recmaster capability as a preference and
allow 2nd tier nodes to be recmaster but that is probably confusing...
and much harder to implement.

I won't try to fix this now because that would touch code that I'm
modifying for another change.  These other changes should also avoid
the above confusion if you're using the recovery lock.  I'm planning to
convert the recovery lock to a cluster lock.  If this is in use then
the first node to take the lock is leader/recovery-master and no other
form of election is run.  This avoids a class of problems where nodes
elect themselves before they are connected to other nodes and then are
unable to take the lock, so may get banned... or where there are other
election bugs...  ;-)

After I've done that, I'll trivially fix this problem.

Note that lmaster is a per-record concept:

 
https://wiki.samba.org/index.php/CTDB_database_design#Distributed_database_design

The capability indicates a node's ability to be lmaster for records.
> This seems to be the start of the problem
> 
> Oct 14 12:52:38 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> Recovery has started
> Oct 14 12:52:39 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: Freeze
> db: smbXsrv_open_global.tdb
> Oct 14 12:52:39 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: Freeze
> db: leases.tdb
> Oct 14 12:52:39 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: Freeze
> db: locking.tdb
> Oct 14 12:52:39 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: Freeze
> db: brlock.tdb
> Oct 14 12:52:39 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: Freeze
> db: smbXsrv_tcon_global.tdb
> Oct 14 12:52:39 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: Freeze
> db: smbXsrv_session_global.tdb
> Oct 14 12:52:39 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: Freeze
> db: smbXsrv_client_global.tdb
> Oct 14 12:52:39 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: Freeze
> db: smbXsrv_version_global.tdb
> Oct 14 12:52:39 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: Freeze
> db: netlogon_creds_cli.tdb
> Oct 14 12:52:39 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: Freeze
> db: g_lock.tdb
> Oct 14 12:52:39 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: Freeze
> db: printer_list.tdb
> Oct 14 12:52:39 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: Freeze
> db: share_info.tdb
> Oct 14 12:52:39 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: Freeze
> db: secrets.tdb
> Oct 14 12:52:39 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: Freeze
> db: account_policy.tdb
> Oct 14 12:52:39 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: Freeze
> db: registry.tdb
> Oct 14 12:52:39 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: Freeze
> db: group_mapping.tdb
> Oct 14 12:52:39 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: Freeze
> db: passdb.tdb
> Oct 14 12:52:39 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: Freeze
> db: winbindd_idmap.tdb
> Oct 14 12:52:39 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: Freeze
> db: ctdb.tdb
> Oct 14 12:52:41 samba-server-s02-usw2-use1-follower-01
> process-agent[575510]: 2021-10-14 12:52:41 UTC | PROCESS | INFO |
> (collector.go:160 in runCheck) | Finish container check #846640 in 19.823?s
> Oct 14 12:52:50 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: Thaw
> db: share_info.tdb generation 1805205230
> Oct 14 12:52:50 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> Release freeze handle for db share_info.tdb
> Oct 14 12:52:50 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: Thaw
> db: secrets.tdb generation 1805205230
> Oct 14 12:52:50 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> Release freeze handle for db secrets.tdb
> Oct 14 12:52:50 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: Thaw
> db: account_policy.tdb generation 1805205230
> Oct 14 12:52:50 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> Release freeze handle for db account_policy.tdb
> Oct 14 12:52:50 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: Thaw
> db: passdb.tdb generation 1805205230
> Oct 14 12:52:50 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> Release freeze handle for db passdb.tdb
> Oct 14 12:52:50 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: Thaw
> db: winbindd_idmap.tdb generation 1805205230
> Oct 14 12:52:50 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> Release freeze handle for db winbindd_idmap.tdb
> Oct 14 12:52:50 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: Thaw
> db: group_mapping.tdb generation 1805205230
> Oct 14 12:52:50 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> Release freeze handle for db group_mapping.tdb
> Oct 14 12:52:50 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: Thaw
> db: registry.tdb generation 1805205230
> Oct 14 12:52:50 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> Release freeze handle for db registry.tdb
> Oct 14 12:52:50 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: Thaw
> db: ctdb.tdb generation 1805205230
> Oct 14 12:52:50 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> Release freeze handle for db ctdb.tdb
> Oct 14 12:52:50 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: Thaw
> db: smbXsrv_client_global.tdb generation 1805205230
> Oct 14 12:52:50 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> Release freeze handle for db smbXsrv_client_global.tdb
> Oct 14 12:52:50 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: Thaw
> db: smbXsrv_tcon_global.tdb generation 1805205230
> Oct 14 12:52:50 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> Release freeze handle for db smbXsrv_tcon_global.tdb
> Oct 14 12:52:50 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: Thaw
> db: smbXsrv_version_global.tdb generation 1805205230
> Oct 14 12:52:50 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> Release freeze handle for db smbXsrv_version_global.tdb
> Oct 14 12:52:51 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> Clearing the tracking database for dbid 0x2d608c16
> Oct 14 12:52:53 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> 10.1.3.19:4379: connected to 10.5.21.119:4379 - 4 connected
> 
> Reconnected to the former master here
> 
> Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: Thaw
> db: netlogon_creds_cli.tdb generation 1805205230
> Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> Release freeze handle for db netlogon_creds_cli.tdb
> Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: Thaw
> db: g_lock.tdb generation 1805205230
> Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> Release freeze handle for db g_lock.tdb
> Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: Thaw
> db: brlock.tdb generation 1805205230
> Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> Release freeze handle for db brlock.tdb
> Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: Thaw
> db: smbXsrv_open_global.tdb generation 1805205230
> Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> Release freeze handle for db smbXsrv_open_global.tdb
> Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: resent
> ctdb_call for db smbXsrv_open_global.tdb reqid 232936286 generation
> 1805205230
> Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: resent
> ctdb_call for db smbXsrv_open_global.tdb reqid 232936287 generation
> 1805205230
> Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: resent
> ctdb_call for db smbXsrv_open_global.tdb reqid 232936288 generation
> 1805205230
> Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: resent
> ctdb_call for db smbXsrv_open_global.tdb reqid 232936289 generation
> 1805205230
> Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: Thaw
> db: locking.tdb generation 1805205230
> Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> Release freeze handle for db locking.tdb
> Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: resent
> ctdb_call for db locking.tdb reqid 232936290 generation 1805205230
> Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: resent
> ctdb_call for db locking.tdb reqid 232936291 generation 1805205230
> Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: resent
> ctdb_call for db locking.tdb reqid 232936292 generation 1805205230
> Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: resent
> ctdb_call for db locking.tdb reqid 232936293 generation 1805205230
> Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: resent
> ctdb_call for db locking.tdb reqid 232936294 generation 1805205230
> Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: resent
> ctdb_call for db locking.tdb reqid 232936295 generation 1805205230
> Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: resent
> ctdb_call for db locking.tdb reqid 232936296 generation 1805205230
> Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: resent
> ctdb_call for db locking.tdb reqid 232936297 generation 1805205230
> Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: resent
> ctdb_call for db locking.tdb reqid 232936298 generation 1805205230
> Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: resent
> ctdb_call for db locking.tdb reqid 232936299 generation 1805205230
> 
> ... 290 total reqids resent
> 
> Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: resent
> ctdb_call for db locking.tdb reqid 232936572 generation 1805205230
> Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: resent
> ctdb_call for db locking.tdb reqid 232936573 generation 1805205230
> Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: resent
> ctdb_call for db locking.tdb reqid 232936574 generation 1805205230
> Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: resent
> ctdb_call for db locking.tdb reqid 232936575 generation 1805205230
> Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: resent
> ctdb_call for db locking.tdb reqid 232936576 generation 1805205230
I also now have a patch to log per-database summaries of resent calls
at NOTICE level and demote the per-call messages to INFO level. ;-)
> Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: Thaw
> db: printer_list.tdb generation 1805205230
> Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> Release freeze handle for db printer_list.tdb
> Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: Thaw
> db: smbXsrv_session_global.tdb generation 1805205230
> Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> Release freeze handle for db smbXsrv_session_global.tdb
> Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 smbd[3530045]:
> [2021/10/14 12:52:56.088975,  0]
> ../../source3/lib/dbwrap/dbwrap_ctdb.c:1222(fetch_locked_internal)
> Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 smbd[188298]:
> [2021/10/14 12:52:56.089017,  0]
> ../../source3/lib/dbwrap/dbwrap_ctdb.c:1222(fetch_locked_internal)
> Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 smbd[2065835]:
> [2021/10/14 12:52:56.088993,  0]
> ../../source3/lib/dbwrap/dbwrap_ctdb.c:1222(fetch_locked_internal)
> Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 smbd[173413]:
> [2021/10/14 12:52:56.089028,  0]
> ../../source3/lib/dbwrap/dbwrap_ctdb.c:1222(fetch_locked_internal)
> Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 smbd[3530045]:
>  db_ctdb_fetch_locked for
> /var/cache/dbdir/volatile/smbXsrv_open_global.tdb.4 key E9E66C7E, chain
> 48608 needed 1 attempts, 25787 milliseconds, chainlock: 0.034000 ms, CTDB
> 25787.146000 ms
> Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 smbd[173413]:
>  db_ctdb_fetch_locked for
> /var/cache/dbdir/volatile/smbXsrv_open_global.tdb.4 key 45E18236, chain
> 61197 needed 1 attempts, 29600 milliseconds, chainlock: 0.018000 ms, CTDB
> 29600.219000 ms
> Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 smbd[2065835]:
>  db_ctdb_fetch_locked for
> /var/cache/dbdir/volatile/smbXsrv_open_global.tdb.4 key C8D42267, chain
> 20590 needed 1 attempts, 26734 milliseconds, chainlock: 0.029000 ms, CTDB
> 26734.891000 ms
> Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 smbd[188298]:
>  db_ctdb_fetch_locked for
> /var/cache/dbdir/volatile/smbXsrv_open_global.tdb.4 key 06EED6D3, chain
> 11119 needed 1 attempts, 27704 milliseconds, chainlock: 0.026000 ms, CTDB
> 27704.504000 ms
> Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: Thaw
> db: leases.tdb generation 1805205230
> Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> Release freeze handle for db leases.tdb
> Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 smbd[1475003]:
> [2021/10/14 12:52:56.155650,  0]
> ../../source3/lib/dbwrap/dbwrap_ctdb.c:1222(fetch_locked_internal)
> Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 smbd[1475003]:
>  db_ctdb_fetch_locked for /var/cache/dbdir/volatile/locking.tdb.4 key
> 7A70496E4466F9AD0000CE3D080000000000000000000000, chain 28289 needed 1
> attempts, 5792 milliseconds, chainlock: 5645.945000 ms, CTDB 146.268000 ms
> Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 smbd[1482694]:
> [2021/10/14 12:52:56.155717,  0]
> ../../source3/lib/dbwrap/dbwrap_ctdb.c:1222(fetch_locked_internal)
> Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 smbd[1482694]:
>  db_ctdb_fetch_locked for
> /var/cache/dbdir/volatile/smbXsrv_session_global.tdb.4 key 861E88DE, chain
> 99225 needed 1 attempts, 5454 milliseconds, chainlock: 5309.574000 ms, CTDB
> 144.484000 ms
> Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 smbd[1482030]:
> [2021/10/14 12:52:56.235506,  0]
> ../../source3/lib/dbwrap/dbwrap_ctdb.c:1222(fetch_locked_internal)
> Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 smbd[1482030]:
>  db_ctdb_fetch_locked for /var/cache/dbdir/volatile/locking.tdb.4 key
> 7A70496E4466F9AD0F3DF538430000000000000000000000, chain 57341 needed 1
> attempts, 103316 milliseconds, chainlock: 0.064000 ms, CTDB 103316.804000
ms
> Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 smbd[1475823]:
> [2021/10/14 12:52:56.244963,  0]
> ../../source3/lib/dbwrap/dbwrap_ctdb.c:1222(fetch_locked_internal)
> Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 smbd[1475823]:
>  db_ctdb_fetch_locked for /var/cache/dbdir/volatile/locking.tdb.4 key
> 7A70496E4466F9ADEF084B9B2C0000000000000000000000, chain 17348 needed 1
> attempts, 849678 milliseconds, chainlock: 0.024000 ms, CTDB 849678.676000
> msd 232936329 generation 1805205230
> 
> etc etc until we restart the whole cluster
> 
> (high hopcount logs happening at the same time, though not as high as
> before)
So, recovery started just after 12:52:32 and the node went away ~25
seconds before 12:52:26, so ~12:52:01.  Note that quite a few of the
warnings are explained by either the node going away or the recovery -
anything less than ~55 seconds.

There are some outliers here at ~103s and ~850s, so I don't think this
is where the problem started.  I think this is where it ended!  My gut
tells me to go back ~850s and see what was happening on node 2.  The
problem was occurring then on some node and the recovery seems to have
solved it with node 2 excluded.

Still no explanation for the high hop counts.  If they are fairly small
(100s, not 1000s) then they may just indicate contention for records,
and given your geographically distributed nodes, this may be a problem
that you can't work around without read-only or sticky records.

If you continue to see huge hop counts then you need to understand the
source of the corruption.

In future, please consider attaching the log file and just quoting the
relevant entries.  Lines seem to have wrapped at ~80 characters, which
made this quite difficult to browse...  :-)

Thanks...

peace & happiness,
martin

Isaac Stone

2021-Oct-15 18:55 UTC

head link

[Samba] db_ctdb_fetch_locked: folder lock contention of over 4 million ms (over an hour)

Hi Martin

Thanks for your analysis, seems I was wrong before in my diagnosis.

When I said "problem starts here" I meant the problem of inaccessible
folders, which is the ultimate problem we are trying to solve.

The folders remained inaccessible until we did a wipe of all volatile
databases and restarted the cluster. Something keeps breaking and then not
recovering.

Digging deeper into the logs I found some problems starting earlier.

Looks like the initial cause is a snafu by the distributed filesystem we
are using. I will get onto questioning them now.

If you want to dig into why ctdb was unable to recover on its own, I have
put my analysis below and will attach the log files. beware they are long
and repetitive

---------------------------------------------------------------------------------------------------------------------------------

First indication is on node 1, at 1am GMT

   Oct 14 01:04:13 samba-ape1-1 mount.objectivefs[2384091]: Diskcache reads
slower than 247 ms, temporarily reducing disk reads
   Oct 14 01:04:13 samba-ape1-1 kernel: ksoftirqd/5: page allocation
failure: order:0, mode:0x484020(GFP_ATOMIC|__GFP_COMP), nodemask=(null)
   Oct 14 01:04:13 samba-ape1-1 kernel: swapper/1: page allocation failure:
order:0, mode:0x484020(GFP_ATOMIC|__GFP_COMP), nodemask=(null)
   Oct 14 01:04:13 samba-ape1-1 kernel: ,cpuset=/,mems_allowed=0
   Oct 14 01:04:13 samba-ape1-1 kernel: ksoftirqd/4: page allocation
failure: order:0, mode:0x484020(GFP_ATOMIC|__GFP_COMP),
nodemask=(null),cpuset=/,mems_allowed=0
   Oct 14 01:04:13 samba-ape1-1 kernel: CPU: 4 PID: 36 Comm: ksoftirqd/4
Kdump: loaded Tainted: G        W    L   --------- -  -
4.18.0-240.15.1.el8_3.x86_64 #1
   Oct 14 01:04:13 samba-ape1-1 kernel: Hardware name: Amazon EC2
i3en.2xlarge/, BIOS 1.0 10/16/2017
   Oct 14 01:04:13 samba-ape1-1 kernel: Call Trace:
   Oct 14 01:04:13 samba-ape1-1 kernel: swapper/6: page allocation failure:
order:0, mode:0x480020(GFP_ATOMIC), nodemask=(null),cpuset   Oct 14 01:04:13
samba-ape1-1 kernel: dump_stack+0x5c/0x80
   Oct 14 01:04:13 samba-ape1-1 kernel: mount.objective: page allocation
failure: order:0, mode:0x480020(GFP_ATOMIC), nodemask=(null)

`mount.objectivefs` is the distributed filesystem we are using. I will
raise this bug with them.

Things run smooth for about an hour then we hit another page allocation
failure

    Oct 14 02:22:42 samba-ape1-1 kernel: warn_alloc: 147 callbacks
suppressed
    Oct 14 02:22:42 samba-ape1-1 kernel: ksoftirqd/2: page allocation
failure: order:0, mode:0x480020(GFP_ATOMIC),
nodemask=(null),cpuset=/,mems_allowed=0
    Oct 14 02:22:42 samba-ape1-1 kernel: ksoftirqd/7: page allocation
failure: order:0, mode:0x480020(GFP_ATOMIC), nodemask=(null)
    Oct 14 02:22:42 samba-ape1-1 kernel: ksoftirqd/0: page allocation
failure: order:0, mode:0x480020(GFP_ATOMIC), nodemask=(null)
    Oct 14 02:22:42 samba-ape1-1 kernel: ksoftirqd/5: page allocation
failure: order:0, mode:0x480020(GFP_ATOMIC), nodemask=(null)
    Oct 14 02:22:42 samba-ape1-1 kernel: ,cpuset=77
    Oct 14 02:22:42 samba-ape1-1 kernel: ksoftirqd/4: page allocation
failure: order:0, mode:0x484020(GFP_ATOMIC|__GFP_COMP), nodemask=(null)
    Oct 14 02:22:42 samba-ape1-1 kernel: /
    Oct 14 02:22:42 samba-ape1-1 kernel: ,cpuset    Oct 14 02:22:42 samba-ape1-1
kernel: ,mems_allowed=0
    Oct 14 02:22:42 samba-ape1-1 kernel: /
    Oct 14 02:22:42 samba-ape1-1 kernel: /,mems_allowed=0
    Oct 14 02:22:42 samba-ape1-1 kernel: CPU: 7 PID: 54 Comm: ksoftirqd/7
Kdump: loaded Tainted: G        W    L   --------- -  -
4.18.0-240.15.1.el8_3.x86_64 #1
    Oct 14 02:22:42 samba-ape1-1 kernel: ,mems_allowed=0
    Oct 14 02:22:42 samba-ape1-1 kernel: Hardware name: Amazon EC2
i3en.2xlarge/, BIOS 1.0 10/16/2017

Call trace is different this time

After a few hundred lines of page allocation failures and callbacks we get
these

  Oct 14 02:23:34 samba-ape1-1 kernel: ena 0000:00:05.0 eth0: failed to
alloc buffer for rx queue 6
  Oct 14 02:23:34 samba-ape1-1 kernel: ena 0000:00:05.0 eth0: refilled rx
qid 6 with only 106 buffers (from 185)
  Oct 14 02:23:34 samba-ape1-1 kernel: ena 0000:00:05.0 eth0: failed to
alloc buffer for rx queue 1
  Oct 14 02:23:34 samba-ape1-1 kernel: ena 0000:00:05.0 eth0: refilled rx
qid 1 with only 55 buffers (from 129)

Then back and forth between eth0 and page allocation failures for a while

Finally ctdb starts to complain

  Oct 14 02:24:38 samba-ape1-1 kernel: ret_from_fork+0x35/0x40
  Oct 14 02:24:38 samba-ape1-1 chronyd[6557]: Forward time jump detected!
  Oct 14 02:24:38 samba-ape1-1 chronyd[6557]: Can't synchronise: no
selectable sources
  Oct 14 02:24:38 samba-ape1-1 ctdbd[4128234]: No event for 15 seconds!

            Oct 14 02:24:40 samba-ape1-1 ctdbd[4128234]: Tearing down
connection to dead node :4
  Oct 14 02:24:40 samba-ape1-1 ctdbd[4128234]: 10.13.3.99:4379: node
10.1.3.19:4379 is dead: 3 connected
  Oct 14 02:24:40 samba-ape1-1 ctdbd[4128234]: 10.13.3.99:4379: connected
to 10.1.3.19:4379 - 4 connected
  Oct 14 02:24:42 samba-ape1-1 ctdbd[4128234]: ctdb_listen_event: Incoming
queue active, rejecting connection from 10.5.21.119
  Oct 14 02:24:42 samba-ape1-1 ctdbd[4128234]: ctdb_listen_event: Incoming
queue active, rejecting connection from 10.5.21.119
  Oct 14 02:24:42 samba-ape1-1 ctdbd[4128234]: ctdb_listen_event: Incoming
queue active, rejecting connection from 10.5.21.119
  Oct 14 02:24:43 samba-ape1-1 ctdb-recoverd[4128336]:
../../ctdb/server/ctdb_client.c:680 control timed out. reqid:118226989
opcode:91 dstnode:2
  Oct 14 02:24:43 samba-ape1-1 ctdb-recoverd[4128336]:
../../ctdb/server/ctdb_client.c:793 ctdb_control_recv failed
  Oct 14 02:24:43 samba-ape1-1 ctdb-recoverd[4128336]:
../../ctdb/server/ctdb_client.c:1047 ctdb_control for getnodes failed
ret:-1 res:-1
  Oct 14 02:24:43 samba-ape1-1 ctdb-recoverd[4128336]:
../../ctdb/server/ctdb_recoverd.c:2377 Unable to get nodemap from recovery
master 2
  Oct 14 02:24:44 samba-ape1-1 kernel: ena 0000:00:05.0 eth0: Found a Tx
that wasn't completed on time, qid 6, index 19.

Meanwhile on node 2 (recovery master at this time):

  Oct 14 02:24:42 samba-usw2-2 ctdbd[30352]: dead count reached for node 1
  Oct 14 02:24:42 samba-usw2-2 ctdbd[30352]: Tearing down connection to
dead node :1
  Oct 14 02:24:42 samba-usw2-2 ctdbd[30352]: 10.5.21.119:4379: node
10.13.3.99:4379 is dead: 3 connected
  Oct 14 02:24:42 samba-usw2-2 ctdb-recoverd[30454]: ctdb_control error:
'node is disconnected'
  Oct 14 02:24:42 samba-usw2-2 ctdb-recoverd[30454]: ctdb_control error:
'node is disconnected'
  Oct 14 02:24:42 samba-usw2-2 ctdb-recoverd[30454]: Async operation failed
with ret=-1 res=-1 opcode=80
  Oct 14 02:24:42 samba-usw2-2 ctdb-recoverd[30454]: Async wait failed -
fail_count=1
  Oct 14 02:24:42 samba-usw2-2 ctdb-recoverd[30454]:
../../ctdb/server/ctdb_client.c:1622 Failed to read node capabilities.
  Oct 14 02:24:42 samba-usw2-2 ctdb-recoverd[30454]:
../../ctdb/server/ctdb_recoverd.c:370 Failed to get node capabilities
  Oct 14 02:24:42 samba-usw2-2 ctdb-recoverd[30454]:
../../ctdb/server/ctdb_recoverd.c:2528 Unable to update node capabilities.
  Oct 14 02:24:42 samba-usw2-2 ctdbd[30352]: Tearing down connection to
dead node :1
  Oct 14 02:24:42 samba-usw2-2 ctdbd[30352]: Tearing down connection to
dead node :1
  Oct 14 02:24:42 samba-usw2-2 ctdbd[30352]: Tearing down connection to
dead node :1
  Oct 14 02:24:45 samba-usw2-2 ctdbd[30352]: Tearing down connection to
dead node :1
  Oct 14 02:24:45 samba-usw2-2 ctdbd[30352]: Tearing down connection to
dead node :1
  Oct 14 02:24:47 samba-usw2-2 ctdbd[30352]: Tearing down connection to
dead node :1
  Oct 14 02:24:48 samba-usw2-2 ctdbd[30352]: Tearing down connection to
dead node :1
  Oct 14 02:24:48 samba-usw2-2 ctdbd[30352]: Tearing down connection to
dead node :1
  Oct 14 02:24:51 samba-usw2-2 ctdbd[30352]: 10.5.21.119:4379: connected to
10.13.3.99:4379 - 4 connected
  Oct 14 02:24:52 samba-usw2-2 ctdb-recoverd[30454]: Node 1 has changed
flags - now 0x0  was 0x2
  Oct 14 02:24:52 samba-usw2-2 ctdb-recoverd[30454]: Remote node 1 had
flags 0x0, local had 0x2 - updating local

About this time "high hopcount" starts to appear on nodes 1 and  2

  Oct 14 02:24:51 samba-ape1-1 ctdb-recoverd[4128336]:
../../ctdb/server/ctdb_client.c:625 reqid 118226989 not found
  Oct 14 02:24:51 samba-ape1-1 ctdbd[4128234]: Tearing down connection to
dead node :2
  Oct 14 02:24:51 samba-ape1-1 ctdbd[4128234]: 10.13.3.99:4379: node
10.5.21.119:4379 is dead: 3 connected
  Oct 14 02:24:51 samba-ape1-1 ctdb-recoverd[4128336]: ctdb_control error:
'node is disconnected'
  Oct 14 02:24:51 samba-ape1-1 ctdb-recoverd[4128336]: ctdb_control error:
'node is disconnected'
  Oct 14 02:24:51 samba-ape1-1 ctdb-recoverd[4128336]: Async operation
failed with ret=-1 res=-1 opcode=80
  Oct 14 02:24:51 samba-ape1-1 ctdbd[4128234]: 10.13.3.99:4379: connected
to 10.5.21.119:4379 - 4 connected
  Oct 14 02:24:52 samba-ape1-1 ctdb-recoverd[4128336]: Node 1 has changed
flags - now 0x0  was 0x2
  Oct 14 02:24:53 samba-ape1-1 ctdbd[4128234]: dead count reached for node 3
  Oct 14 02:24:53 samba-ape1-1 ctdbd[4128234]: Tearing down connection to
dead node :3
  Oct 14 02:24:53 samba-ape1-1 ctdbd[4128234]: 10.13.3.99:4379: node
10.9.11.95:4379 is dead: 3 connected
  Oct 14 02:24:53 samba-ape1-1 ctdb-recoverd[4128336]: ctdb_control error:
'node is disconnected'
  Oct 14 02:24:53 samba-ape1-1 ctdb-recoverd[4128336]: ctdb_control error:
'node is disconnected'
  Oct 14 02:24:53 samba-ape1-1 ctdb-recoverd[4128336]: Async operation
failed with ret=-1 res=-1 opcode=80
  Oct 14 02:24:53 samba-ape1-1 ctdb-recoverd[4128336]: Async wait failed -
fail_count=2
  Oct 14 02:24:53 samba-ape1-1 ctdb-recoverd[4128336]:
../../ctdb/server/ctdb_client.c:1622 Failed to read node capabilities.
  Oct 14 02:24:53 samba-ape1-1 ctdb-recoverd[4128336]:
../../ctdb/server/ctdb_recoverd.c:370 Failed to get node capabilities
  Oct 14 02:24:53 samba-ape1-1 ctdb-recoverd[4128336]:
../../ctdb/server/ctdb_recoverd.c:2528 Unable to update node capabilities.
  Oct 14 02:24:53 samba-ape1-1 ctdbd[4128234]: 10.13.3.99:4379: connected
to 10.9.11.95:4379 - 4 connected
  Oct 14 02:25:00 samba-ape1-1 ctdbd[4128234]: High hopcount 97
dbid:locking.tdb key:0x92424339 reqid=088ff395 pnn:1 src:1 lmaster:2
header->dmaster:0 dst:2
  Oct 14 02:25:00 samba-ape1-1 ctdbd[4128234]: High hopcount 99
dbid:locking.tdb key:0x92424339 reqid=088ff395 pnn:1 src:1 lmaster:2
header->dmaster:0 dst:2

hopcounts hit 1000 in about a minute, 10000 in 12 minutes, and 100000 in 2
hours.
hopcounts continue to increase until 12:54, which is about the same time
node 2 is restarted (boot logs on node 2 start at 12:52)

  Oct 14 12:47:56 samba-ape1-1 ctdbd[4128234]: High hopcount 377097
dbid:locking.tdb key:0x807ae9d7 reqid=287ce281 pnn:1 src:0 lmaster:2
header->dmaster:0 dst:2
  Oct 14 12:47:56 samba-ape1-1 ctdbd[4128234]: High hopcount 13899
dbid:locking.tdb key:0x807ae9d7 reqid=089a85ec pnn:1 src:1 lmaster:2
header->dmaster:0 dst:2
  Oct 14 12:47:56 samba-ape1-1 ctdbd[4128234]: High hopcount 84098
dbid:locking.tdb key:0x92424339 reqid=0b7fffab pnn:1 src:3 lmaster:2
header->dmaster:0 dst:2
  Oct 14 12:54:10 samba-ape1-1 ctdbd[4128234]: High hopcount 97
dbid:locking.tdb key:0x92424339 reqid=28b5077a pnn:1 src:0 lmaster:2
header->dmaster:3 dst:2
  Oct 14 12:54:10 samba-ape1-1 rsyslogd[3489957]: imjournal: 25996 messages
lost due to rate-limiting (20000 allowed within 600 seconds)
  Oct 14 12:54:10 samba-ape1-1 ctdbd[4128234]: High hopcount 99
dbid:locking.tdb key:0x92424339 reqid=0de2611b pnn:1 src:4 lmaster:2
header->dmaster:4 dst:2
  Oct 14 12:54:10 samba-ape1-1 smbd[2710455]: [2021/10/14 12:54:10.899136,
 0] ../../source3/lib/dbwrap/dbwrap_ctdb.c:1222(fetch_locked_internal)
  Oct 14 12:54:10 samba-ape1-1 smbd[2710455]:  db_ctdb_fetch_locked for
/var/cache/dbdir/volatile/locking.tdb.1 key
7A70496E4466F9ADBE3DF538430000000000000000000000, chain 91584 needed 1
attempts, 31800705 milliseconds, chainlock: 0.027000 ms, CTDB 31800705.
  Oct 14 12:54:13 samba-ape1-1 ctdbd[4128234]: High hopcount 397
dbid:locking.tdb key:0x92424339 reqid=0de25a99 pnn:1 src:4 lmaster:2
header->dmaster:3 dst:2
  Oct 14 12:54:13 samba-ape1-1 ctdbd[4128234]: High hopcount 399
dbid:locking.tdb key:0x92424339 reqid=0de25b5d pnn:1 src:4 lmaster:2
header->dmaster:3 dst:2
  Oct 14 12:54:13 samba-ape1-1 ctdbd[4128234]: High hopcount 399
dbid:locking.tdb key:0x92424339 reqid=0de25abb pnn:1 src:4 lmaster:2
header->dmaster:3 dst:2

Both hopcound and db_ctdb_fetch_locked errors in the logs until 12:55,
after which only db_ctdb_fetch_locked errors appear

  Oct 14 12:54:58 samba-ape1-1 ctdbd[4128234]: High hopcount 699
dbid:locking.tdb key:0x92424339 reqid=089adfea pnn:1 src:1 lmaster:2
header->dmaster:3 dst:2
  Oct 14 12:54:58 samba-ape1-1 ctdbd[4128234]: High hopcount 697
dbid:locking.tdb key:0x92424339 reqid=089adfa7 pnn:1 src:1 lmaster:2
header->dmaster:3 dst:2
  Oct 14 12:54:58 samba-ape1-1 ctdbd[4128234]: High hopcount 697
dbid:locking.tdb key:0x92424339 reqid=089adfa0 pnn:1 src:1 lmaster:2
header->dmaster:3 dst:2
  Oct 14 12:55:00 samba-ape1-1 smbd[2824067]: [2021/10/14 12:55:00.854317,
 0] ../../source3/lib/dbwrap/dbwrap_ctdb.c:1222(fetch_locked_internal)
  Oct 14 12:55:00 samba-ape1-1 smbd[2824067]:  db_ctdb_fetch_locked for
/var/cache/dbdir/volatile/locking.tdb.1 key
7A70496E4466F9ADBE3DF538430000000000000000000000, chain 91584 needed 1
attempts, 15885080 milliseconds, chainlock: 0.074000 ms, CTDB 15885079.?Oct
14 12:55:02 samba-ape1-1 smbd[2684088]: [2021/10/14 12:55:02.763524,  0]
../../source3/lib/dbwrap/dbwrap_ctdb.c:1222(fetch_locked_internal)
  Oct 14 12:55:02 samba-ape1-1 smbd[2684088]:  db_ctdb_fetch_locked for
/var/cache/dbdir/volatile/locking.tdb.1 key
7A70496E4466F9ADBE3DF538430000000000000000000000, chain 91584 needed 1
attempts, 34119752 milliseconds, chainlock: 0.038000 ms, CTDB 34119751.?Oct
14 12:55:03 samba-ape1-1 smbd[2838843]: [2021/10/14 12:55:03.073739,  0]
../../source3/lib/dbwrap/dbwrap_ctdb.c:1222(fetch_locked_internal)

Errors of the exact same key continue for a couple hours until the volatile
db directory is wiped and ctdb is restarted.


On Thu, Oct 14, 2021 at 11:48 PM Martin Schwenke <martin at meltin.net>
wrote:
> Hi Isaac,
>
> Comments inline below...
>
> On Thu, 14 Oct 2021 14:20:38 -0700, Isaac Stone <isaac.stone at
som.com>
> wrote:
>
> > Hi, same problem happened again this morning, I have some more context
> from
> > the logs
> >
> > There was an election triggered by a dead node
> >
> > In our configuration only nodes 0 and 2 have `lmaster capability=true`
> and
> > `recmaster capability=true` in /etc/ctdb/ctdb.conf.
> > On other nodes these are set to false.
> >
> > Looking at this
> >
> > Oct 14 12:52:26 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> > Tearing down connection to dead node :2
> > Oct 14 12:52:26 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> > 10.1.3.19:4379: node 10.5.21.119:4379 is dead: 3 connected
> > Oct 14 12:52:26 samba-server-s02-usw2-use1-follower-01
> > ctdb-recoverd[32407]: Recmaster node 2 is disconnected/deleted. Force
> > election
> > Oct 14 12:52:26 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> > Recovery mode set to ACTIVE
> > Oct 14 12:52:27 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
This
> > node (4) is now the recovery master
> > Oct 14 12:52:27 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> Remote
> > node (3) is now the recovery master
> > Oct 14 12:52:32 samba-server-s02-usw2-use1-follower-01
> > ctdb-recoverd[32407]: Election period ended, master=0
> >
> > Seems there was a network interruption in connection to node 2, which
was
> > the recmaster/lmaster
> >
> > Does it mean nodes 3 and 4 were both elected? Or does `master=0` mean
0
> is
> > master?
> >
> > node 0 _should_ be the master.
>
> No, it means that nodes 4 and then 3 won rounds of the election.  If
> you look in the log on node 3 you'll probably find that it noticed the
> dead node at about the same time, so also initiated an election.  Both
> nodes will have let the other win.  I now have a patch to demote those
> election progress messages to INFO level.
>
> The final message indicates who won (i.e. 0) - the election ends via a
> timeout.
>
> So, this is a bug (and those messages were useful).  A node that can't
> win an election shouldn't start an election with a standard vote for
> itself. A node that can't win an election should probably go into
> recovery, set the master to the unknown value and then wait for another
> node to win an election.  The biggest problem is if no node can be
> master then the cluster sits in recovery until a potential master
> reappears. We could treat the recmaster capability as a preference and
> allow 2nd tier nodes to be recmaster but that is probably confusing...
> and much harder to implement.
>
> I won't try to fix this now because that would touch code that I'm
> modifying for another change.  These other changes should also avoid
> the above confusion if you're using the recovery lock.  I'm
planning to
> convert the recovery lock to a cluster lock.  If this is in use then
> the first node to take the lock is leader/recovery-master and no other
> form of election is run.  This avoids a class of problems where nodes
> elect themselves before they are connected to other nodes and then are
> unable to take the lock, so may get banned... or where there are other
> election bugs...  ;-)
>
> After I've done that, I'll trivially fix this problem.
>
> Note that lmaster is a per-record concept:
>
>
>
https://wiki.samba.org/index.php/CTDB_database_design#Distributed_database_design
>
> The capability indicates a node's ability to be lmaster for records.
>
> > This seems to be the start of the problem
> >
> > Oct 14 12:52:38 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> > Recovery has started
> > Oct 14 12:52:39 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> Freeze
> > db: smbXsrv_open_global.tdb
> > Oct 14 12:52:39 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> Freeze
> > db: leases.tdb
> > Oct 14 12:52:39 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> Freeze
> > db: locking.tdb
> > Oct 14 12:52:39 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> Freeze
> > db: brlock.tdb
> > Oct 14 12:52:39 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> Freeze
> > db: smbXsrv_tcon_global.tdb
> > Oct 14 12:52:39 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> Freeze
> > db: smbXsrv_session_global.tdb
> > Oct 14 12:52:39 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> Freeze
> > db: smbXsrv_client_global.tdb
> > Oct 14 12:52:39 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> Freeze
> > db: smbXsrv_version_global.tdb
> > Oct 14 12:52:39 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> Freeze
> > db: netlogon_creds_cli.tdb
> > Oct 14 12:52:39 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> Freeze
> > db: g_lock.tdb
> > Oct 14 12:52:39 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> Freeze
> > db: printer_list.tdb
> > Oct 14 12:52:39 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> Freeze
> > db: share_info.tdb
> > Oct 14 12:52:39 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> Freeze
> > db: secrets.tdb
> > Oct 14 12:52:39 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> Freeze
> > db: account_policy.tdb
> > Oct 14 12:52:39 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> Freeze
> > db: registry.tdb
> > Oct 14 12:52:39 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> Freeze
> > db: group_mapping.tdb
> > Oct 14 12:52:39 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> Freeze
> > db: passdb.tdb
> > Oct 14 12:52:39 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> Freeze
> > db: winbindd_idmap.tdb
> > Oct 14 12:52:39 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> Freeze
> > db: ctdb.tdb
> > Oct 14 12:52:41 samba-server-s02-usw2-use1-follower-01
> > process-agent[575510]: 2021-10-14 12:52:41 UTC | PROCESS | INFO |
> > (collector.go:160 in runCheck) | Finish container check #846640 in
> 19.823?s
> > Oct 14 12:52:50 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
Thaw
> > db: share_info.tdb generation 1805205230
> > Oct 14 12:52:50 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> > Release freeze handle for db share_info.tdb
> > Oct 14 12:52:50 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
Thaw
> > db: secrets.tdb generation 1805205230
> > Oct 14 12:52:50 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> > Release freeze handle for db secrets.tdb
> > Oct 14 12:52:50 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
Thaw
> > db: account_policy.tdb generation 1805205230
> > Oct 14 12:52:50 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> > Release freeze handle for db account_policy.tdb
> > Oct 14 12:52:50 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
Thaw
> > db: passdb.tdb generation 1805205230
> > Oct 14 12:52:50 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> > Release freeze handle for db passdb.tdb
> > Oct 14 12:52:50 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
Thaw
> > db: winbindd_idmap.tdb generation 1805205230
> > Oct 14 12:52:50 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> > Release freeze handle for db winbindd_idmap.tdb
> > Oct 14 12:52:50 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
Thaw
> > db: group_mapping.tdb generation 1805205230
> > Oct 14 12:52:50 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> > Release freeze handle for db group_mapping.tdb
> > Oct 14 12:52:50 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
Thaw
> > db: registry.tdb generation 1805205230
> > Oct 14 12:52:50 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> > Release freeze handle for db registry.tdb
> > Oct 14 12:52:50 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
Thaw
> > db: ctdb.tdb generation 1805205230
> > Oct 14 12:52:50 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> > Release freeze handle for db ctdb.tdb
> > Oct 14 12:52:50 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
Thaw
> > db: smbXsrv_client_global.tdb generation 1805205230
> > Oct 14 12:52:50 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> > Release freeze handle for db smbXsrv_client_global.tdb
> > Oct 14 12:52:50 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
Thaw
> > db: smbXsrv_tcon_global.tdb generation 1805205230
> > Oct 14 12:52:50 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> > Release freeze handle for db smbXsrv_tcon_global.tdb
> > Oct 14 12:52:50 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
Thaw
> > db: smbXsrv_version_global.tdb generation 1805205230
> > Oct 14 12:52:50 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> > Release freeze handle for db smbXsrv_version_global.tdb
> > Oct 14 12:52:51 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> > Clearing the tracking database for dbid 0x2d608c16
> > Oct 14 12:52:53 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> > 10.1.3.19:4379: connected to 10.5.21.119:4379 - 4 connected
> >
> > Reconnected to the former master here
> >
> > Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
Thaw
> > db: netlogon_creds_cli.tdb generation 1805205230
> > Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> > Release freeze handle for db netlogon_creds_cli.tdb
> > Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
Thaw
> > db: g_lock.tdb generation 1805205230
> > Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> > Release freeze handle for db g_lock.tdb
> > Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
Thaw
> > db: brlock.tdb generation 1805205230
> > Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> > Release freeze handle for db brlock.tdb
> > Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
Thaw
> > db: smbXsrv_open_global.tdb generation 1805205230
> > Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> > Release freeze handle for db smbXsrv_open_global.tdb
> > Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> resent
> > ctdb_call for db smbXsrv_open_global.tdb reqid 232936286 generation
> > 1805205230
> > Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> resent
> > ctdb_call for db smbXsrv_open_global.tdb reqid 232936287 generation
> > 1805205230
> > Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> resent
> > ctdb_call for db smbXsrv_open_global.tdb reqid 232936288 generation
> > 1805205230
> > Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> resent
> > ctdb_call for db smbXsrv_open_global.tdb reqid 232936289 generation
> > 1805205230
> > Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
Thaw
> > db: locking.tdb generation 1805205230
> > Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> > Release freeze handle for db locking.tdb
> > Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> resent
> > ctdb_call for db locking.tdb reqid 232936290 generation 1805205230
> > Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> resent
> > ctdb_call for db locking.tdb reqid 232936291 generation 1805205230
> > Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> resent
> > ctdb_call for db locking.tdb reqid 232936292 generation 1805205230
> > Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> resent
> > ctdb_call for db locking.tdb reqid 232936293 generation 1805205230
> > Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> resent
> > ctdb_call for db locking.tdb reqid 232936294 generation 1805205230
> > Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> resent
> > ctdb_call for db locking.tdb reqid 232936295 generation 1805205230
> > Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> resent
> > ctdb_call for db locking.tdb reqid 232936296 generation 1805205230
> > Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> resent
> > ctdb_call for db locking.tdb reqid 232936297 generation 1805205230
> > Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> resent
> > ctdb_call for db locking.tdb reqid 232936298 generation 1805205230
> > Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> resent
> > ctdb_call for db locking.tdb reqid 232936299 generation 1805205230
> >
> > ... 290 total reqids resent
> >
> > Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> resent
> > ctdb_call for db locking.tdb reqid 232936572 generation 1805205230
> > Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> resent
> > ctdb_call for db locking.tdb reqid 232936573 generation 1805205230
> > Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> resent
> > ctdb_call for db locking.tdb reqid 232936574 generation 1805205230
> > Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> resent
> > ctdb_call for db locking.tdb reqid 232936575 generation 1805205230
> > Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> resent
> > ctdb_call for db locking.tdb reqid 232936576 generation 1805205230
>
> I also now have a patch to log per-database summaries of resent calls
> at NOTICE level and demote the per-call messages to INFO level. ;-)
>
> > Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
Thaw
> > db: printer_list.tdb generation 1805205230
> > Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> > Release freeze handle for db printer_list.tdb
> > Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
Thaw
> > db: smbXsrv_session_global.tdb generation 1805205230
> > Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> > Release freeze handle for db smbXsrv_session_global.tdb
> > Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 smbd[3530045]:
> > [2021/10/14 12:52:56.088975,  0]
> > ../../source3/lib/dbwrap/dbwrap_ctdb.c:1222(fetch_locked_internal)
> > Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 smbd[188298]:
> > [2021/10/14 12:52:56.089017,  0]
> > ../../source3/lib/dbwrap/dbwrap_ctdb.c:1222(fetch_locked_internal)
> > Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 smbd[2065835]:
> > [2021/10/14 12:52:56.088993,  0]
> > ../../source3/lib/dbwrap/dbwrap_ctdb.c:1222(fetch_locked_internal)
> > Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 smbd[173413]:
> > [2021/10/14 12:52:56.089028,  0]
> > ../../source3/lib/dbwrap/dbwrap_ctdb.c:1222(fetch_locked_internal)
> > Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 smbd[3530045]:
> >  db_ctdb_fetch_locked for
> > /var/cache/dbdir/volatile/smbXsrv_open_global.tdb.4 key E9E66C7E,
chain
> > 48608 needed 1 attempts, 25787 milliseconds, chainlock: 0.034000 ms,
CTDB
> > 25787.146000 ms
> > Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 smbd[173413]:
> >  db_ctdb_fetch_locked for
> > /var/cache/dbdir/volatile/smbXsrv_open_global.tdb.4 key 45E18236,
chain
> > 61197 needed 1 attempts, 29600 milliseconds, chainlock: 0.018000 ms,
CTDB
> > 29600.219000 ms
> > Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 smbd[2065835]:
> >  db_ctdb_fetch_locked for
> > /var/cache/dbdir/volatile/smbXsrv_open_global.tdb.4 key C8D42267,
chain
> > 20590 needed 1 attempts, 26734 milliseconds, chainlock: 0.029000 ms,
CTDB
> > 26734.891000 ms
> > Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 smbd[188298]:
> >  db_ctdb_fetch_locked for
> > /var/cache/dbdir/volatile/smbXsrv_open_global.tdb.4 key 06EED6D3,
chain
> > 11119 needed 1 attempts, 27704 milliseconds, chainlock: 0.026000 ms,
CTDB
> > 27704.504000 ms
> > Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
Thaw
> > db: leases.tdb generation 1805205230
> > Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> > Release freeze handle for db leases.tdb
> > Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 smbd[1475003]:
> > [2021/10/14 12:52:56.155650,  0]
> > ../../source3/lib/dbwrap/dbwrap_ctdb.c:1222(fetch_locked_internal)
> > Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 smbd[1475003]:
> >  db_ctdb_fetch_locked for /var/cache/dbdir/volatile/locking.tdb.4 key
> > 7A70496E4466F9AD0000CE3D080000000000000000000000, chain 28289 needed 1
> > attempts, 5792 milliseconds, chainlock: 5645.945000 ms, CTDB
146.268000
> ms
> > Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 smbd[1482694]:
> > [2021/10/14 12:52:56.155717,  0]
> > ../../source3/lib/dbwrap/dbwrap_ctdb.c:1222(fetch_locked_internal)
> > Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 smbd[1482694]:
> >  db_ctdb_fetch_locked for
> > /var/cache/dbdir/volatile/smbXsrv_session_global.tdb.4 key 861E88DE,
> chain
> > 99225 needed 1 attempts, 5454 milliseconds, chainlock: 5309.574000 ms,
> CTDB
> > 144.484000 ms
> > Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 smbd[1482030]:
> > [2021/10/14 12:52:56.235506,  0]
> > ../../source3/lib/dbwrap/dbwrap_ctdb.c:1222(fetch_locked_internal)
> > Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 smbd[1482030]:
> >  db_ctdb_fetch_locked for /var/cache/dbdir/volatile/locking.tdb.4 key
> > 7A70496E4466F9AD0F3DF538430000000000000000000000, chain 57341 needed 1
> > attempts, 103316 milliseconds, chainlock: 0.064000 ms, CTDB
> 103316.804000 ms
> > Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 smbd[1475823]:
> > [2021/10/14 12:52:56.244963,  0]
> > ../../source3/lib/dbwrap/dbwrap_ctdb.c:1222(fetch_locked_internal)
> > Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 smbd[1475823]:
> >  db_ctdb_fetch_locked for /var/cache/dbdir/volatile/locking.tdb.4 key
> > 7A70496E4466F9ADEF084B9B2C0000000000000000000000, chain 17348 needed 1
> > attempts, 849678 milliseconds, chainlock: 0.024000 ms, CTDB
849678.676000
> > msd 232936329 generation 1805205230
> >
> > etc etc until we restart the whole cluster
> >
> > (high hopcount logs happening at the same time, though not as high as
> > before)
>
> So, recovery started just after 12:52:32 and the node went away ~25
> seconds before 12:52:26, so ~12:52:01.  Note that quite a few of the
> warnings are explained by either the node going away or the recovery -
> anything less than ~55 seconds.
>
> There are some outliers here at ~103s and ~850s, so I don't think this
> is where the problem started.  I think this is where it ended!  My gut
> tells me to go back ~850s and see what was happening on node 2.  The
> problem was occurring then on some node and the recovery seems to have
> solved it with node 2 excluded.
>
> Still no explanation for the high hop counts.  If they are fairly small
> (100s, not 1000s) then they may just indicate contention for records,
> and given your geographically distributed nodes, this may be a problem
> that you can't work around without read-only or sticky records.
>
> If you continue to see huge hop counts then you need to understand the
> source of the corruption.
>
> In future, please consider attaching the log file and just quoting the
> relevant entries.  Lines seem to have wrapped at ~80 characters, which
> made this quite difficult to browse...  :-)
>
> Thanks...
>
> peace & happiness,
> martin
>
>

samba - Oct 2021 - db_ctdb_fetch_locked: folder lock contention of over 4 million ms (over an hour)

[Samba] db_ctdb_fetch_locked: folder lock contention of over 4 million ms (over an hour)

[Samba] db_ctdb_fetch_locked: folder lock contention of over 4 million ms (over an hour)