thr3ads.net - samba - [Samba] CTDB potential locking issue [Sep 2018]

If this information is useful, please help other people find it:
Share via:

David C

2018-Sep-18 18:34 UTC

[Samba] CTDB potential locking issue

Hi All

I have a newly implemented two node CTDB cluster running on CentOS 7, Samba
4.7.1

The node network is a direct 1Gb link

Storage is Cephfs

ctdb status is OK

It seems to be running well so far but I'm frequently seeing the following
in my log.smbd:

[2018/09/18 19:16:15.897742,  0]> ../source3/lib/dbwrap/dbwrap_ctdb.c:1207(fetch_locked_internal)
>   db_ctdb_fetch_locked for /var/lib/ctdb/locking.tdb.1 key
> DE0726567AF1EAFD4A741403000100000000000000000000, chain 3642 needed 16
> attempts, 315 milliseconds, chainlock: 78.340000 ms, CTDB 236.511000 ms
> [2018/09/18 19:16:15.958368,  0]
> ../source3/lib/dbwrap/dbwrap_ctdb.c:1207(fetch_locked_internal)
>   db_ctdb_fetch_locked for /var/lib/ctdb/locking.tdb.1 key
> DE0726567AF1EAFD4A741403000100000000000000000000, chain 3642 needed 15
> attempts, 297 milliseconds, chainlock: 58.532000 ms, CTDB 239.124000 ms
> [2018/09/18 19:16:18.139443,  0]
> ../source3/lib/dbwrap/dbwrap_ctdb.c:1207(fetch_locked_internal)
>   db_ctdb_fetch_locked for /var/lib/ctdb/locking.tdb.1 key
> DE0726567AF1EAFD4A741403000100000000000000000000, chain 3642 needed 11
> attempts, 128 milliseconds, chainlock: 27.141000 ms, CTDB 101.450000 ms
>
Can someone advise what this means and if it's something to be concerned
about?

I've worked with ctdb set-ups in the past and don't recall seeing these
messages but that could just be because verbosity is higher with this
version.

It's a new implementation, I've not had any performance or locking
issues
reported and smbstatus -L looks normal although have noticed load average
on servers are a little higher than expected

Here is my sanitized smb.comf and ctdbd.conf:

[global]
   workgroup = DOMAIN
   realm = DOMAIN.LOCAL
   security = ADS
   clustering = yes
   idmap config * : range = 16777216-33554431
   template shell = /bin/false
   kerberos method = secrets only
   winbind use default domain = True
   netbios name = HOSTNAME
   log level = 1
   create krb5 conf = yes
   encrypt passwords = yes
   unix extensions = No
   min protocol = smb2
   max protocol = smb2
   strict allocate = yes
   follow symlinks = yes
   allow insecure wide links = yes
   idmap config DOMAIN : backend = ad
   idmap config DOMAIN : range = 700-199999
   idmap config DOMAIN : schema_mode = rfc2307
   idmap config DOMAIN : unix_nss_info = yes
   idmap config DOMAIN:unix_primary_group = yes
   winbind enum users = yes
   winbind enum groups = yes
   winbind normalize names = no
   winbind reconnect delay = 2
   winbind cache time = 900
   name resolve order = host
   disable netbios = yes
   fileid:algorithm = fsid
   vfs objects = fileid
   usershare allow guests = yes
   map to guest = Bad User
   hide dot files = Yes
   hide files = /$*/
   hide special files = yes
   strict sync = No

/etc/ctdb/ctdbd.conf:

 CTDB_PUBLIC_ADDRESSES=/etc/ctdb/public_addresses
 CTDB_MANAGES_SAMBA=yes
 CTDB_SAMBA_SKIP_SHARE_CHECK=yes
 CTDB_MANAGES_WINBIND=yes
 CTDB_MANAGES_NFS=yes
 CTDB_NFS_CALLOUT=/etc/ctdb/nfs-ganesha-callout
 CTDB_NFS_CHECKS_DIR=/etc/ctdb/nfs-checks-ganesha.d
 CTDB_NFS_SKIP_SHARE_CHECK=yes
 CTDB_DEBUGLEVEL=NOTICE

Micha Ballmann

2018-Sep-18 19:15 UTC

head link

[Samba] CTDB potential locking issue

How did you mount your cephfs filesystem?

Am 18. September 2018 20:34:25 MESZ schrieb David C via samba <samba at
lists.samba.org>:>Hi All
>
>I have a newly implemented two node CTDB cluster running on CentOS 7,
>Samba
>4.7.1
>
>The node network is a direct 1Gb link
>
>Storage is Cephfs
>
>ctdb status is OK
>
>It seems to be running well so far but I'm frequently seeing the
>following
>in my log.smbd:
>
>[2018/09/18 19:16:15.897742,  0]
>> ../source3/lib/dbwrap/dbwrap_ctdb.c:1207(fetch_locked_internal)
>>   db_ctdb_fetch_locked for /var/lib/ctdb/locking.tdb.1 key
>> DE0726567AF1EAFD4A741403000100000000000000000000, chain 3642 needed
>16
>> attempts, 315 milliseconds, chainlock: 78.340000 ms, CTDB 236.511000
>ms
>> [2018/09/18 19:16:15.958368,  0]
>> ../source3/lib/dbwrap/dbwrap_ctdb.c:1207(fetch_locked_internal)
>>   db_ctdb_fetch_locked for /var/lib/ctdb/locking.tdb.1 key
>> DE0726567AF1EAFD4A741403000100000000000000000000, chain 3642 needed
>15
>> attempts, 297 milliseconds, chainlock: 58.532000 ms, CTDB 239.124000
>ms
>> [2018/09/18 19:16:18.139443,  0]
>> ../source3/lib/dbwrap/dbwrap_ctdb.c:1207(fetch_locked_internal)
>>   db_ctdb_fetch_locked for /var/lib/ctdb/locking.tdb.1 key
>> DE0726567AF1EAFD4A741403000100000000000000000000, chain 3642 needed
>11
>> attempts, 128 milliseconds, chainlock: 27.141000 ms, CTDB 101.450000
>ms
>>
>
>Can someone advise what this means and if it's something to be
>concerned
>about?
>
>I've worked with ctdb set-ups in the past and don't recall seeing
these
>messages but that could just be because verbosity is higher with this
>version.
>
>It's a new implementation, I've not had any performance or locking
>issues
>reported and smbstatus -L looks normal although have noticed load
>average
>on servers are a little higher than expected
>
>Here is my sanitized smb.comf and ctdbd.conf:
>
>[global]
>   workgroup = DOMAIN
>   realm = DOMAIN.LOCAL
>   security = ADS
>   clustering = yes
>   idmap config * : range = 16777216-33554431
>   template shell = /bin/false
>   kerberos method = secrets only
>   winbind use default domain = True
>   netbios name = HOSTNAME
>   log level = 1
>   create krb5 conf = yes
>   encrypt passwords = yes
>   unix extensions = No
>   min protocol = smb2
>   max protocol = smb2
>   strict allocate = yes
>   follow symlinks = yes
>   allow insecure wide links = yes
>   idmap config DOMAIN : backend = ad
>   idmap config DOMAIN : range = 700-199999
>   idmap config DOMAIN : schema_mode = rfc2307
>   idmap config DOMAIN : unix_nss_info = yes
>   idmap config DOMAIN:unix_primary_group = yes
>   winbind enum users = yes
>   winbind enum groups = yes
>   winbind normalize names = no
>   winbind reconnect delay = 2
>   winbind cache time = 900
>   name resolve order = host
>   disable netbios = yes
>   fileid:algorithm = fsid
>   vfs objects = fileid
>   usershare allow guests = yes
>   map to guest = Bad User
>   hide dot files = Yes
>   hide files = /$*/
>   hide special files = yes
>   strict sync = No
>
>/etc/ctdb/ctdbd.conf:
>
> CTDB_PUBLIC_ADDRESSES=/etc/ctdb/public_addresses
> CTDB_MANAGES_SAMBA=yes
> CTDB_SAMBA_SKIP_SHARE_CHECK=yes
> CTDB_MANAGES_WINBIND=yes
> CTDB_MANAGES_NFS=yes
> CTDB_NFS_CALLOUT=/etc/ctdb/nfs-ganesha-callout
> CTDB_NFS_CHECKS_DIR=/etc/ctdb/nfs-checks-ganesha.d
> CTDB_NFS_SKIP_SHARE_CHECK=yes
> CTDB_DEBUGLEVEL=NOTICE
>-- 
>To unsubscribe from this list go to the following URL and read the
>instructions:  https://lists.samba.org/mailman/options/samba
-- 
Diese Nachricht wurde von meinem Android-Gerät mit K-9 Mail gesendet.

David C

2018-Sep-18 20:11 UTC

head link

[Samba] CTDB potential locking issue

Hi Micha

With the cephfs kernel client.

That prompted me to check mount options and they actually differ on each
node

Node 1:
rw,noatime,name=admin,secret=<hidden>,acl,wsize=16777216,rasize=268439552,_netdev
Node 2:
rw,relatime,name=admin,secret=<hidden>,acl,wsize=16777216

Need to fix that although don't think that's related to this particular
issue

A number of directories in the filesystem are exported like this:

[username]
        comment = username home folder
        path = /cephfs/dir
        read only = no
        guest ok = no
        valid users = user
        wide links = Yes
        aio read size = 1
        aio write size = 1
        vfs objects = aio_pthread
        dfree command = /usr/local/bin/dfree.sh

Thanks,



On Tue, Sep 18, 2018 at 8:15 PM Micha Ballmann <ballmann at uni-landau.de>
wrote:
> How did you mount your cephfs filesystem?
>
> Am 18. September 2018 20:34:25 MESZ schrieb David C via samba <
> samba at lists.samba.org>:
>>
>> Hi All
>>
>> I have a newly implemented two node CTDB cluster running on CentOS 7,
Samba
>> 4.7.1
>>
>> The node network is a direct 1Gb link
>>
>> Storage is Cephfs
>>
>> ctdb status is OK
>>
>> It seems to be running well so far but I'm frequently seeing the
following
>> in my log.smbd:
>>
>> [2018/09/18 19:16:15.897742,  0]
>>
>>>  ../source3/lib/dbwrap/dbwrap_ctdb.c:1207(fetch_locked_internal)
>>>    db_ctdb_fetch_locked for /var/lib/ctdb/locking.tdb.1 key
>>>  DE0726567AF1EAFD4A741403000100000000000000000000, chain 3642
needed 16
>>>  attempts, 315 milliseconds, chainlock: 78.340000 ms, CTDB
236.511000 ms
>>>  [2018/09/18 19:16:15.958368,  0]
>>>  ../source3/lib/dbwrap/dbwrap_ctdb.c:1207(fetch_locked_internal)
>>>    db_ctdb_fetch_locked for /var/lib/ctdb/locking.tdb.1 key
>>>  DE0726567AF1EAFD4A741403000100000000000000000000, chain 3642
needed 15
>>>  attempts, 297 milliseconds, chainlock: 58.532000 ms, CTDB
239.124000 ms
>>>  [2018/09/18 19:16:18.139443,  0]
>>>  ../source3/lib/dbwrap/dbwrap_ctdb.c:1207(fetch_locked_internal)
>>>    db_ctdb_fetch_locked for /var/lib/ctdb/locking.tdb.1 key
>>>  DE0726567AF1EAFD4A741403000100000000000000000000, chain 3642
needed 11
>>>  attempts, 128 milliseconds, chainlock: 27.141000 ms, CTDB
101.450000 ms
>>>
>>>
>> Can someone advise what this means and if it's something to be
concerned
>> about?
>>
>> I've worked with ctdb set-ups in the past and don't recall
seeing these
>> messages but that could just be because verbosity is higher with this
>> version.
>>
>> It's a new implementation, I've not had any performance or
locking issues
>> reported and smbstatus -L looks normal although have noticed load
average
>> on servers are a little higher than expected
>>
>> Here is my sanitized smb.comf and ctdbd.conf:
>>
>> [global]
>>    workgroup = DOMAIN
>>    realm = DOMAIN.LOCAL
>>    security = ADS
>>    clustering = yes
>>    idmap config * : range = 16777216-33554431
>>    template shell = /bin/false
>>    kerberos method = secrets only
>>    winbind use default domain = True
>>    netbios name = HOSTNAME
>>    log level = 1
>>    create krb5 conf = yes
>>    encrypt passwords = yes
>>    unix extensions = No
>>    min protocol = smb2
>>    max protocol = smb2
>>    strict allocate = yes
>>    follow symlinks = yes
>>    allow insecure wide links = yes
>>    idmap config DOMAIN : backend = ad
>>    idmap config DOMAIN : range = 700-199999
>>    idmap config DOMAIN : schema_mode = rfc2307
>>    idmap config DOMAIN : unix_nss_info = yes
>>    idmap config DOMAIN:unix_primary_group = yes
>>    winbind enum users = yes
>>    winbind enum groups = yes
>>    winbind normalize names = no
>>    winbind reconnect delay = 2
>>    winbind cache time = 900
>>    name resolve order = host
>>    disable netbios = yes
>>    fileid:algorithm = fsid
>>    vfs objects = fileid
>>    usershare allow guests = yes
>>    map to guest = Bad User
>>    hide dot files = Yes
>>    hide files = /$*/
>>    hide special files = yes
>>    strict sync = No
>>
>> /etc/ctdb/ctdbd.conf:
>>
>>  CTDB_PUBLIC_ADDRESSES=/etc/ctdb/public_addresses
>>  CTDB_MANAGES_SAMBA=yes
>>  CTDB_SAMBA_SKIP_SHARE_CHECK=yes
>>  CTDB_MANAGES_WINBIND=yes
>>  CTDB_MANAGES_NFS=yes
>>  CTDB_NFS_CALLOUT=/etc/ctdb/nfs-ganesha-callout
>>  CTDB_NFS_CHECKS_DIR=/etc/ctdb/nfs-checks-ganesha.d
>>  CTDB_NFS_SKIP_SHARE_CHECK=yes
>>  CTDB_DEBUGLEVEL=NOTICE
>>
>>
> --
> Diese Nachricht wurde von meinem Android-Gerät mit K-9 Mail gesendet.
>

Martin Schwenke

2018-Sep-19 04:19 UTC

head link

[Samba] CTDB potential locking issue

Hi David,

On Tue, 18 Sep 2018 19:34:25 +0100, David C via samba
<samba at lists.samba.org> wrote:
> I have a newly implemented two node CTDB cluster running on CentOS 7, Samba
> 4.7.1
> 
> The node network is a direct 1Gb link
> 
> Storage is Cephfs
> 
> ctdb status is OK
> 
> It seems to be running well so far but I'm frequently seeing the
following
> in my log.smbd:
> 
> [2018/09/18 19:16:15.897742,  0]
> > ../source3/lib/dbwrap/dbwrap_ctdb.c:1207(fetch_locked_internal)
> >   db_ctdb_fetch_locked for /var/lib/ctdb/locking.tdb.1 key
> > DE0726567AF1EAFD4A741403000100000000000000000000, chain 3642 needed 16
> > attempts, 315 milliseconds, chainlock: 78.340000 ms, CTDB 236.511000
ms
> > [2018/09/18 19:16:15.958368,  0]
> > ../source3/lib/dbwrap/dbwrap_ctdb.c:1207(fetch_locked_internal)
> >   db_ctdb_fetch_locked for /var/lib/ctdb/locking.tdb.1 key
> > DE0726567AF1EAFD4A741403000100000000000000000000, chain 3642 needed 15
> > attempts, 297 milliseconds, chainlock: 58.532000 ms, CTDB 239.124000
ms
> > [2018/09/18 19:16:18.139443,  0]
> > ../source3/lib/dbwrap/dbwrap_ctdb.c:1207(fetch_locked_internal)
> >   db_ctdb_fetch_locked for /var/lib/ctdb/locking.tdb.1 key
> > DE0726567AF1EAFD4A741403000100000000000000000000, chain 3642 needed 11
> > attempts, 128 milliseconds, chainlock: 27.141000 ms, CTDB 101.450000
ms
> Can someone advise what this means and if it's something to be
concerned
> about?
As SMB clients perform operations on files, ctdbd's main role is to
migrate metadata about those files, such as locking/share-mode info,
between nodes of the cluster, 

The above messages are telling you that ctdbd took more than a
pre-defined threshold to migrate a record.  This probably means that
there is contention between nodes for the file or directory represented
by the given key.  If this is the case then I would expect to see
similar messages in the log on each node.  If the numbers get much
higher then I would expect to see a performance impact.

Is it always the same key?  A small group of keys?  That is likely
to mean contention.  If migrations for many different keys are taking
longer than the threshold then ctdbd might just be overloaded.

You may be able to use the "net tdb locking" command to find out more
about the key in question.  You'll need to run the command while
clients are accessing the file represented by the key.  If it is
constantly and heavily then that shouldn't be a problem.  ;-)

If the contention is for the root directory of a share, and you don't
actually need lock coherency there, then you could think about using the

  fileid:algorithm = fsname_norootdir

option.  However, I note you're using "fileid:algorithm = fsid". 
If
that is needed for Cephfs then the fsname_norootdir option might not be
appropriate.

You could also consider using the fileid:nolockinode hack if it is
appropriate.

You should definitely read vfs_fileid(8) before using either of these
options.

Although clustering has obvious benefits, it doesn't come for
free.  Dealing with contention can be tricky...  :-)

peace & happiness,
martin

David C

2018-Sep-19 18:00 UTC

head link

[Samba] CTDB potential locking issue

Hi Martin

Many thanks for the detailed response. A few follow-ups inline:

On Wed, Sep 19, 2018 at 5:19 AM Martin Schwenke <martin at meltin.net>
wrote:
> Hi David,
>
> On Tue, 18 Sep 2018 19:34:25 +0100, David C via samba
> <samba at lists.samba.org> wrote:
>
> > I have a newly implemented two node CTDB cluster running on CentOS 7,
> Samba
> > 4.7.1
> >
> > The node network is a direct 1Gb link
> >
> > Storage is Cephfs
> >
> > ctdb status is OK
> >
> > It seems to be running well so far but I'm frequently seeing the
> following
> > in my log.smbd:
> >
> > [2018/09/18 19:16:15.897742,  0]
> > > ../source3/lib/dbwrap/dbwrap_ctdb.c:1207(fetch_locked_internal)
> > >   db_ctdb_fetch_locked for /var/lib/ctdb/locking.tdb.1 key
> > > DE0726567AF1EAFD4A741403000100000000000000000000, chain 3642
needed 16
> > > attempts, 315 milliseconds, chainlock: 78.340000 ms, CTDB
236.511000 ms
> > > [2018/09/18 19:16:15.958368,  0]
> > > ../source3/lib/dbwrap/dbwrap_ctdb.c:1207(fetch_locked_internal)
> > >   db_ctdb_fetch_locked for /var/lib/ctdb/locking.tdb.1 key
> > > DE0726567AF1EAFD4A741403000100000000000000000000, chain 3642
needed 15
> > > attempts, 297 milliseconds, chainlock: 58.532000 ms, CTDB
239.124000 ms
> > > [2018/09/18 19:16:18.139443,  0]
> > > ../source3/lib/dbwrap/dbwrap_ctdb.c:1207(fetch_locked_internal)
> > >   db_ctdb_fetch_locked for /var/lib/ctdb/locking.tdb.1 key
> > > DE0726567AF1EAFD4A741403000100000000000000000000, chain 3642
needed 11
> > > attempts, 128 milliseconds, chainlock: 27.141000 ms, CTDB
101.450000 ms
>
> > Can someone advise what this means and if it's something to be
concerned
> > about?
>
> As SMB clients perform operations on files, ctdbd's main role is to
> migrate metadata about those files, such as locking/share-mode info,
> between nodes of the cluster,
>
> The above messages are telling you that ctdbd took more than a
> pre-defined threshold to migrate a record.  This probably means that
> there is contention between nodes for the file or directory represented
> by the given key.  If this is the case then I would expect to see
> similar messages in the log on each node.  If the numbers get much
> higher then I would expect to see a performance impact.
>
> Is it always the same key?  A small group of keys?  That is likely
> to mean contention.  If migrations for many different keys are taking
> longer than the threshold then ctdbd might just be overloaded.
>
Confirmed always the same key which I suppose is good news?

> You may be able to use the "net tdb locking" command to find out
more
> about the key in question.  You'll need to run the command while
> clients are accessing the file represented by the key.  If it is
> constantly and heavily then that shouldn't be a problem.  ;-)
>
Currently reporting: "Record with key
DE0726567AF1EAFD4A741403000100000000000000000000 not found."

So I guess clients aren't currently accessing it, the messages are fairly
frequent so I should be able to catch it. I may just run that command on a
loop until it catches it.

Is there any other way of translating the key to the inode?
>
> If the contention is for the root directory of a share, and you don't
> actually need lock coherency there, then you could think about using the
>
>   fileid:algorithm = fsname_norootdir
>
> option.  However, I note you're using "fileid:algorithm =
fsid".  If
> that is needed for Cephfs then the fsname_norootdir option might not be
> appropriate.
>
This was a leftover from a short-lived experiment with OCFS2 where I think
it was required. I think CephFS should be fine with fsname.
>
> You could also consider using the fileid:nolockinode hack if it is
> appropriate.
>
> You should definitely read vfs_fileid(8) before using either of these
> options.
>
I'll have a read. Thanks again for your assistance.
>
> Although clustering has obvious benefits, it doesn't come for
> free.  Dealing with contention can be tricky...  :-)
>
> peace & happiness,
> martin
>

Reasonably Related Threads

Search for more maybe matching threads

samba - Sep 2018 - CTDB potential locking issue

[Samba] CTDB potential locking issue

[Samba] CTDB potential locking issue

[Samba] CTDB potential locking issue

[Samba] CTDB potential locking issue

[Samba] CTDB potential locking issue

Reasonably Related Threads