thr3ads.net - samba - [Samba] db_ctdb_fetch_locked: folder lock contention of over 4 million ms (over an hour) [Oct 2021]

If this information is useful, please help other people find it:
Share via:

Isaac Stone

2021-Oct-08 17:27 UTC

[Samba] db_ctdb_fetch_locked: folder lock contention of over 4 million ms (over an hour)

Had the same problem again yesterday, only different.

- same problem in that there was a single folder no one could access.

- different in that the errors logged were different

Instead of db_ctdb_fetch_locked errors in the smbd logs, we only had high
hopcount errors in the ctdbd logs, and only on two of the nodes

hopcout was also incredibly high and numerous, logging upwords of 60 times
a second with hopcounts in excess of 500 thousand.

The errors were all for the exact same key.

On node 2:

 High hopcount 430199 dbid:locking.tdb key:0x5930e2c3 reqid=003ef9ef pnn:3
src:0 lmaster:4 header->dmaster:2 dst:4

On node 3:

 High hopcount 350598 dbid:locking.tdb key:0x5930e2c3 reqid=0e34f704 pnn:4
src:4 lmaster:4 header->dmaster:3 dst:3

We fixed the error same as last time - bring down all nodes, wipe the
/volatile/ directory, and bring the cluster back up.

Questions:

Is there a way to correlate the key in the "high hopcount" logs to a
specific file lock? Something like `net tdb locking`?

Is there a way to diagnose and fix these problems without bringing the
whole cluster down and wiping all locks everywhere? It is effective but
also causes systemwide service interruptions.

I noticed the `setdbsticky` option in the man page. It says it is
experimental, but also indicates it may solve this problem. Is there any
more information on this option? I am hesitant to recommend experimental
settings for prod.

I am leaning towards changing fileid:algorithm to fsname_nodirs as a
solution to this recurring problem, but I want to understand the problem
better before making that change. Any advice is greatly appreciated.

 - Isaac Stone

On Thu, Oct 7, 2021 at 9:01 AM Isaac Stone <isaac.stone at som.com> wrote:
> Hey Martin, thanks for the information.
>
> We did notice there were duplicates, such as `locking.tdb.2` being on
> nodes 2 and 3. This also happened during the EBS outage the week before.
> Could this be a symptom of the high hopcount problem? records being moved
> between nodes because one node had an outage or similar? Would there be a
> way to recover without restarting ctdb?
>
> The problem with the stop, delete, start method we have been using is that
> it affects all users of the system, whereas the issues are usually isolated
> to a single folder.
>
> Cheers,
> - isaac
>
> On Thu, Oct 7, 2021 at 2:51 AM Martin Schwenke <martin at meltin.net>
wrote:
>
>> Hi Isaac,
>>
>> On Wed, 6 Oct 2021 11:57:07 -0700, Isaac Stone via samba
>> <samba at lists.samba.org> wrote:
>>
>> > We have been noticing slowness and outages in samba.
>> >
>> > First time was the aws EBS outage of 9/27. We are hosting our ctdb
`tdb`
>> > files on aws nvme, and the outage caused issues that did not seem
to get
>> > recovered. The smbd logs had lots of this kind of message
>> >
>> > db_ctdb_fetch_locked for /var/cache/dbdir/volatile/locking.tdb.2
key
>> > AECED2BC2A66F9AD1C55142C290000000000000000000000, chain 62588
needed 1
>> > attempts, 21471 milliseconds, chainlock: 7257.000000 ms, CTDB
>> > 14214.887000 ms
>>
>> > We have samba+ctdb setup with multiple globally distributed nodes.
>> >
>> > The issue persisted well after the EBS outage, but seemed
restricted to
>> > accessing a single folder. Anyone who tried to access that folder
would
>> > have their machine hard lock and were forced to reboot.
>> >
>> > To fix and restore functionality to our users we stopped ctdb on
all
>> nodes,
>> > wiped the `volatile` directory and then started ctdb again.
>> >
>> > This fixed the issue, but then yesterday we had the same thing
again.
>> This
>> > time it was a different folder, and the wait times were much
higher,
>> past
>> > 4,500,000 milliseconds.
>> >
>> > I attempted to run `ctdb wipedb locking.tdb` but that did not seem
to
>> have
>> > any effect. In the end we had to stop, delete volatile, and
restart
>> again.
>> >
>> > questions:
>> >
>> > What could be the root cause? we see occasional
db_ctdb_fetch_locked
>> logs
>> > but usually not more than 10 seconds. But this is the second big
spike
>> and
>> > I want to know what caused it and how to avoid. I am looking at
possible
>> > changing fileid:algorithm from fsname to fsname_nodirs, any
potential
>> > problems with this?
>>
>> When I have seen this it has been due to an smbd holding a lock (in
>> locking.tdb) and getting stuck in a system call to the cluster
>> filesystem. This is usually due to a cluster filesystem bug or an
>> underlying storage problem (an EBS outage might do that).
>>
>> You can diagnose this by hand by seeing if one of your nodes has an
>> smbd in D state and then check /proc/<pid>/stack for the relevant
pid.
>>
>> See also the "lock debug script" option in the [database]
section of the
>> ctdb.conf(5) manual page.  However, if your installation is using
>> robust mutexes for locking records in TDBs (the modern default) then
>> you would need the changes that went into 4.15 to be able to use that.
>>
>> > Durning around the same time as the locking logs in smbd we see
ctdbd
>> logs
>> > like this
>> >
>> > High hopcount 198 dbid:locking.tdb
>> >
>> > Could be related?
>>
>> High hopcount is described here:
>>
>>
>>
https://wiki.samba.org/index.php/CTDB_database_design#Performance_problems_due_to_contention
>>
>> It is possible that they're related if there is a high enough
latency
>> when fetching records between nodes.  It might not be
>> smbd-stuck-in-cluster-filesystem...  but 4,500 / 200 = 22.5, so
you'd
>> need a 22 second latency to get that number... and it seems unlikely.
>>
>> I hope some of that is helpful...
>>
>> peace & happiness,
>> martin
>>
>>

Martin Schwenke

2021-Oct-11 11:17 UTC

head link

[Samba] db_ctdb_fetch_locked: folder lock contention of over 4 million ms (over an hour)

On Fri, 8 Oct 2021 10:27:09 -0700, Isaac Stone <isaac.stone at som.com>
wrote:
> Had the same problem again yesterday, only different.
> 
> - same problem in that there was a single folder no one could access.
> 
> - different in that the errors logged were different
> 
> Instead of db_ctdb_fetch_locked errors in the smbd logs, we only had high
> hopcount errors in the ctdbd logs, and only on two of the nodes
> 
> hopcout was also incredibly high and numerous, logging upwords of 60 times
> a second with hopcounts in excess of 500 thousand.
> 
> The errors were all for the exact same key.
> 
> On node 2:
> 
>  High hopcount 430199 dbid:locking.tdb key:0x5930e2c3 reqid=003ef9ef pnn:3
> src:0 lmaster:4 header->dmaster:2 dst:4
> 
> On node 3:
> 
>  High hopcount 350598 dbid:locking.tdb key:0x5930e2c3 reqid=0e34f704 pnn:4
> src:4 lmaster:4 header->dmaster:3 dst:3
> 
> We fixed the error same as last time - bring down all nodes, wipe the
> /volatile/ directory, and bring the cluster back up.
That's bad.  That indicates that something is getting corrupted
somewhere.  Is there any indication that you're running out of disk
space or memory?

Quite a few years ago there were some races in vacuuming (garbage
collection) where this could happen but they were fixed.  Last time we
saw this sort of thing we audited the code and couldn't find a
(relevant) bug.   For the situation we were looking at I'm quite sure
that we tracked it down to running out of something (disk?  memory?)
and being unable to recover.  If this is happening repeatedly there
should be a common clue in the logs, around the time it starts.
> Questions:
> 
> Is there a way to correlate the key in the "high hopcount" logs
to a
> specific file lock? Something like `net tdb locking`?
The key in the high hopcount logs is a hash of the key, so it is one
way.  We really should print the hex key.  If you can guess the key
then you can generate the hash (using
lib/tdb/common/hash.c:tdb_jenkins_hash()) and confirm.
> Is there a way to diagnose and fix these problems without bringing the
> whole cluster down and wiping all locks everywhere? It is effective but
> also causes systemwide service interruptions.
With slightly improved logging that might be possible.  However, there
is a catch-22.  If you shut down all of the nodes then you disconnect
the clients and you don't risk filesystem data corruption.  If you
figure out the key, and figure out how to remove it from the databases
on all nodes (undocumented "ctdb tstore" with empty data *might* do
what
you want - I'd have to read code to confirm, but need to sleep instead)
then you'll be alive but you risk filesystem data corruption.  The
simplified reasoning here is that you may have a client that thinks it
has a lock on a file or directory (and is modifying it), you'll
effectively release the lock by deleting the record, and then other
clients will also get access (and also modify it).
> I noticed the `setdbsticky` option in the man page. It says it is
> experimental, but also indicates it may solve this problem. Is there any
> more information on this option? I am hesitant to recommend experimental
> settings for prod.
Hop counts that high indicate some sort of corruption.  It seems very
unlikely that this is just a performance problem.
> I am leaning towards changing fileid:algorithm to fsname_nodirs as a
> solution to this recurring problem, but I want to understand the problem
> better before making that change. Any advice is greatly appreciated.
I'm trying to get an overview of CTDB performance issues here:

  https://wiki.samba.org/index.php/CTDB_Performance

Nobody has fully explained the implications of that option there.  You
have to understand that it breaks lock coherency (i.e. effectively no
locking on directories) and the implications of that.

Sorry I can't be more help right now... busy day, need sleep...

peace & happiness,
martin

samba - Oct 2021 - db_ctdb_fetch_locked: folder lock contention of over 4 million ms (over an hour)

[Samba] db_ctdb_fetch_locked: folder lock contention of over 4 million ms (over an hour)

[Samba] db_ctdb_fetch_locked: folder lock contention of over 4 million ms (over an hour)