thr3ads.net - samba - [Samba] db_ctdb_fetch_locked: folder lock contention of over 4 million ms (over an hour) [Oct 2021]

If this information is useful, please help other people find it:
Share via:

Isaac Stone

2021-Oct-06 18:57 UTC

[Samba] db_ctdb_fetch_locked: folder lock contention of over 4 million ms (over an hour)

hello

We have been noticing slowness and outages in samba.

First time was the aws EBS outage of 9/27. We are hosting our ctdb `tdb`
files on aws nvme, and the outage caused issues that did not seem to get
recovered. The smbd logs had lots of this kind of message

db_ctdb_fetch_locked for /var/cache/dbdir/volatile/locking.tdb.2 key
AECED2BC2A66F9AD1C55142C290000000000000000000000, chain 62588 needed 1
attempts, 21471 milliseconds, chainlock: 7257.000000 ms, CTDB
14214.887000 ms

We have samba+ctdb setup with multiple globally distributed nodes.

The issue persisted well after the EBS outage, but seemed restricted to
accessing a single folder. Anyone who tried to access that folder would
have their machine hard lock and were forced to reboot.

To fix and restore functionality to our users we stopped ctdb on all nodes,
wiped the `volatile` directory and then started ctdb again.

This fixed the issue, but then yesterday we had the same thing again. This
time it was a different folder, and the wait times were much higher, past
4,500,000 milliseconds.

I attempted to run `ctdb wipedb locking.tdb` but that did not seem to have
any effect. In the end we had to stop, delete volatile, and restart again.

questions:

What could be the root cause? we see occasional db_ctdb_fetch_locked logs
but usually not more than 10 seconds. But this is the second big spike and
I want to know what caused it and how to avoid. I am looking at possible
changing fileid:algorithm from fsname to fsname_nodirs, any potential
problems with this?

Durning around the same time as the locking logs in smbd we see ctdbd logs
like this

High hopcount 198 dbid:locking.tdb

Could be related?

---------

ctdb version 4.13.7-SerNet-RedHat-11.el8
Samba version 4.13.7-SerNet-RedHat-11.el8

smb.conf:

[global]
netbios name = WBD-S02-USE1
realm = GLOBAL.DOMAIN.COM
workgroup = GLOBAL

security = ads

clustering = yes
fake oplocks = no

log level = 2

idmap config * : backend = tdb
idmap config * : range = 3000-7999
idmap config GLOBAL : backend = rid
idmap config GLOBAL : range = 10000-999999

dedicated keytab file = /etc/krb5.keytab
kerberos method = secrets and keytab

map acl inherit = yes

winbind refresh tickets = yes

# disables printing:
load printers = no
printing = bsd
printcap name = /dev/null
disable spoolss = yes

bind interfaces only = yes
interfaces = lo eth0

vfs objects = fileid fruit streams_xattr acl_xattr shadow_copy2
acl_xattr:ignore system acls = yes
fruit:nfs_aces = no
fruit:metadata = stream

#================ Share Definitions =================[DC]
path = "/mnt/use1/DC"
writeable = yes
browsable = yes
fileid:algorithm = fsname
shadow:mountpoint = /mnt/use1
shadow:snapdir = /snapshots/use1
shadow:snapsharepath = DC
shadow:format = %Y-%m-%dT%H:%M:%SZ

Martin Schwenke

2021-Oct-07 09:44 UTC

head link

[Samba] db_ctdb_fetch_locked: folder lock contention of over 4 million ms (over an hour)

Hi Isaac,

On Wed, 6 Oct 2021 11:57:07 -0700, Isaac Stone via samba
<samba at lists.samba.org> wrote:
> We have been noticing slowness and outages in samba.
> 
> First time was the aws EBS outage of 9/27. We are hosting our ctdb `tdb`
> files on aws nvme, and the outage caused issues that did not seem to get
> recovered. The smbd logs had lots of this kind of message
> 
> db_ctdb_fetch_locked for /var/cache/dbdir/volatile/locking.tdb.2 key
> AECED2BC2A66F9AD1C55142C290000000000000000000000, chain 62588 needed 1
> attempts, 21471 milliseconds, chainlock: 7257.000000 ms, CTDB
> 14214.887000 ms
> We have samba+ctdb setup with multiple globally distributed nodes.
> 
> The issue persisted well after the EBS outage, but seemed restricted to
> accessing a single folder. Anyone who tried to access that folder would
> have their machine hard lock and were forced to reboot.
> 
> To fix and restore functionality to our users we stopped ctdb on all nodes,
> wiped the `volatile` directory and then started ctdb again.
> 
> This fixed the issue, but then yesterday we had the same thing again. This
> time it was a different folder, and the wait times were much higher, past
> 4,500,000 milliseconds.
> 
> I attempted to run `ctdb wipedb locking.tdb` but that did not seem to have
> any effect. In the end we had to stop, delete volatile, and restart again.
> 
> questions:
> 
> What could be the root cause? we see occasional db_ctdb_fetch_locked logs
> but usually not more than 10 seconds. But this is the second big spike and
> I want to know what caused it and how to avoid. I am looking at possible
> changing fileid:algorithm from fsname to fsname_nodirs, any potential
> problems with this?
When I have seen this it has been due to an smbd holding a lock (in
locking.tdb) and getting stuck in a system call to the cluster
filesystem. This is usually due to a cluster filesystem bug or an
underlying storage problem (an EBS outage might do that).

You can diagnose this by hand by seeing if one of your nodes has an
smbd in D state and then check /proc/<pid>/stack for the relevant pid.

See also the "lock debug script" option in the [database] section of
the
ctdb.conf(5) manual page.  However, if your installation is using
robust mutexes for locking records in TDBs (the modern default) then
you would need the changes that went into 4.15 to be able to use that.
> Durning around the same time as the locking logs in smbd we see ctdbd logs
> like this
> 
> High hopcount 198 dbid:locking.tdb
> 
> Could be related?
High hopcount is described here:

 
https://wiki.samba.org/index.php/CTDB_database_design#Performance_problems_due_to_contention

It is possible that they're related if there is a high enough latency
when fetching records between nodes.  It might not be
smbd-stuck-in-cluster-filesystem...  but 4,500 / 200 = 22.5, so you'd
need a 22 second latency to get that number... and it seems unlikely.

I hope some of that is helpful...

peace & happiness,
martin

samba - Oct 2021 - db_ctdb_fetch_locked: folder lock contention of over 4 million ms (over an hour)

[Samba] db_ctdb_fetch_locked: folder lock contention of over 4 million ms (over an hour)

[Samba] db_ctdb_fetch_locked: folder lock contention of over 4 million ms (over an hour)