Hi All I have a newly implemented two node CTDB cluster running on CentOS 7, Samba 4.7.1 The node network is a direct 1Gb link Storage is Cephfs ctdb status is OK It seems to be running well so far but I'm frequently seeing the following in my log.smbd: [2018/09/18 19:16:15.897742, 0]> ../source3/lib/dbwrap/dbwrap_ctdb.c:1207(fetch_locked_internal) > db_ctdb_fetch_locked for /var/lib/ctdb/locking.tdb.1 key > DE0726567AF1EAFD4A741403000100000000000000000000, chain 3642 needed 16 > attempts, 315 milliseconds, chainlock: 78.340000 ms, CTDB 236.511000 ms > [2018/09/18 19:16:15.958368, 0] > ../source3/lib/dbwrap/dbwrap_ctdb.c:1207(fetch_locked_internal) > db_ctdb_fetch_locked for /var/lib/ctdb/locking.tdb.1 key > DE0726567AF1EAFD4A741403000100000000000000000000, chain 3642 needed 15 > attempts, 297 milliseconds, chainlock: 58.532000 ms, CTDB 239.124000 ms > [2018/09/18 19:16:18.139443, 0] > ../source3/lib/dbwrap/dbwrap_ctdb.c:1207(fetch_locked_internal) > db_ctdb_fetch_locked for /var/lib/ctdb/locking.tdb.1 key > DE0726567AF1EAFD4A741403000100000000000000000000, chain 3642 needed 11 > attempts, 128 milliseconds, chainlock: 27.141000 ms, CTDB 101.450000 ms >Can someone advise what this means and if it's something to be concerned about? I've worked with ctdb set-ups in the past and don't recall seeing these messages but that could just be because verbosity is higher with this version. It's a new implementation, I've not had any performance or locking issues reported and smbstatus -L looks normal although have noticed load average on servers are a little higher than expected Here is my sanitized smb.comf and ctdbd.conf: [global] workgroup = DOMAIN realm = DOMAIN.LOCAL security = ADS clustering = yes idmap config * : range = 16777216-33554431 template shell = /bin/false kerberos method = secrets only winbind use default domain = True netbios name = HOSTNAME log level = 1 create krb5 conf = yes encrypt passwords = yes unix extensions = No min protocol = smb2 max protocol = smb2 strict allocate = yes follow symlinks = yes allow insecure wide links = yes idmap config DOMAIN : backend = ad idmap config DOMAIN : range = 700-199999 idmap config DOMAIN : schema_mode = rfc2307 idmap config DOMAIN : unix_nss_info = yes idmap config DOMAIN:unix_primary_group = yes winbind enum users = yes winbind enum groups = yes winbind normalize names = no winbind reconnect delay = 2 winbind cache time = 900 name resolve order = host disable netbios = yes fileid:algorithm = fsid vfs objects = fileid usershare allow guests = yes map to guest = Bad User hide dot files = Yes hide files = /$*/ hide special files = yes strict sync = No /etc/ctdb/ctdbd.conf: CTDB_PUBLIC_ADDRESSES=/etc/ctdb/public_addresses CTDB_MANAGES_SAMBA=yes CTDB_SAMBA_SKIP_SHARE_CHECK=yes CTDB_MANAGES_WINBIND=yes CTDB_MANAGES_NFS=yes CTDB_NFS_CALLOUT=/etc/ctdb/nfs-ganesha-callout CTDB_NFS_CHECKS_DIR=/etc/ctdb/nfs-checks-ganesha.d CTDB_NFS_SKIP_SHARE_CHECK=yes CTDB_DEBUGLEVEL=NOTICE
How did you mount your cephfs filesystem? Am 18. September 2018 20:34:25 MESZ schrieb David C via samba <samba at lists.samba.org>:>Hi All > >I have a newly implemented two node CTDB cluster running on CentOS 7, >Samba >4.7.1 > >The node network is a direct 1Gb link > >Storage is Cephfs > >ctdb status is OK > >It seems to be running well so far but I'm frequently seeing the >following >in my log.smbd: > >[2018/09/18 19:16:15.897742, 0] >> ../source3/lib/dbwrap/dbwrap_ctdb.c:1207(fetch_locked_internal) >> db_ctdb_fetch_locked for /var/lib/ctdb/locking.tdb.1 key >> DE0726567AF1EAFD4A741403000100000000000000000000, chain 3642 needed >16 >> attempts, 315 milliseconds, chainlock: 78.340000 ms, CTDB 236.511000 >ms >> [2018/09/18 19:16:15.958368, 0] >> ../source3/lib/dbwrap/dbwrap_ctdb.c:1207(fetch_locked_internal) >> db_ctdb_fetch_locked for /var/lib/ctdb/locking.tdb.1 key >> DE0726567AF1EAFD4A741403000100000000000000000000, chain 3642 needed >15 >> attempts, 297 milliseconds, chainlock: 58.532000 ms, CTDB 239.124000 >ms >> [2018/09/18 19:16:18.139443, 0] >> ../source3/lib/dbwrap/dbwrap_ctdb.c:1207(fetch_locked_internal) >> db_ctdb_fetch_locked for /var/lib/ctdb/locking.tdb.1 key >> DE0726567AF1EAFD4A741403000100000000000000000000, chain 3642 needed >11 >> attempts, 128 milliseconds, chainlock: 27.141000 ms, CTDB 101.450000 >ms >> > >Can someone advise what this means and if it's something to be >concerned >about? > >I've worked with ctdb set-ups in the past and don't recall seeing these >messages but that could just be because verbosity is higher with this >version. > >It's a new implementation, I've not had any performance or locking >issues >reported and smbstatus -L looks normal although have noticed load >average >on servers are a little higher than expected > >Here is my sanitized smb.comf and ctdbd.conf: > >[global] > workgroup = DOMAIN > realm = DOMAIN.LOCAL > security = ADS > clustering = yes > idmap config * : range = 16777216-33554431 > template shell = /bin/false > kerberos method = secrets only > winbind use default domain = True > netbios name = HOSTNAME > log level = 1 > create krb5 conf = yes > encrypt passwords = yes > unix extensions = No > min protocol = smb2 > max protocol = smb2 > strict allocate = yes > follow symlinks = yes > allow insecure wide links = yes > idmap config DOMAIN : backend = ad > idmap config DOMAIN : range = 700-199999 > idmap config DOMAIN : schema_mode = rfc2307 > idmap config DOMAIN : unix_nss_info = yes > idmap config DOMAIN:unix_primary_group = yes > winbind enum users = yes > winbind enum groups = yes > winbind normalize names = no > winbind reconnect delay = 2 > winbind cache time = 900 > name resolve order = host > disable netbios = yes > fileid:algorithm = fsid > vfs objects = fileid > usershare allow guests = yes > map to guest = Bad User > hide dot files = Yes > hide files = /$*/ > hide special files = yes > strict sync = No > >/etc/ctdb/ctdbd.conf: > > CTDB_PUBLIC_ADDRESSES=/etc/ctdb/public_addresses > CTDB_MANAGES_SAMBA=yes > CTDB_SAMBA_SKIP_SHARE_CHECK=yes > CTDB_MANAGES_WINBIND=yes > CTDB_MANAGES_NFS=yes > CTDB_NFS_CALLOUT=/etc/ctdb/nfs-ganesha-callout > CTDB_NFS_CHECKS_DIR=/etc/ctdb/nfs-checks-ganesha.d > CTDB_NFS_SKIP_SHARE_CHECK=yes > CTDB_DEBUGLEVEL=NOTICE >-- >To unsubscribe from this list go to the following URL and read the >instructions: https://lists.samba.org/mailman/options/samba-- Diese Nachricht wurde von meinem Android-Gerät mit K-9 Mail gesendet.
Hi Micha With the cephfs kernel client. That prompted me to check mount options and they actually differ on each node Node 1: rw,noatime,name=admin,secret=<hidden>,acl,wsize=16777216,rasize=268439552,_netdev Node 2: rw,relatime,name=admin,secret=<hidden>,acl,wsize=16777216 Need to fix that although don't think that's related to this particular issue A number of directories in the filesystem are exported like this: [username] comment = username home folder path = /cephfs/dir read only = no guest ok = no valid users = user wide links = Yes aio read size = 1 aio write size = 1 vfs objects = aio_pthread dfree command = /usr/local/bin/dfree.sh Thanks, On Tue, Sep 18, 2018 at 8:15 PM Micha Ballmann <ballmann at uni-landau.de> wrote:> How did you mount your cephfs filesystem? > > Am 18. September 2018 20:34:25 MESZ schrieb David C via samba < > samba at lists.samba.org>: >> >> Hi All >> >> I have a newly implemented two node CTDB cluster running on CentOS 7, Samba >> 4.7.1 >> >> The node network is a direct 1Gb link >> >> Storage is Cephfs >> >> ctdb status is OK >> >> It seems to be running well so far but I'm frequently seeing the following >> in my log.smbd: >> >> [2018/09/18 19:16:15.897742, 0] >> >>> ../source3/lib/dbwrap/dbwrap_ctdb.c:1207(fetch_locked_internal) >>> db_ctdb_fetch_locked for /var/lib/ctdb/locking.tdb.1 key >>> DE0726567AF1EAFD4A741403000100000000000000000000, chain 3642 needed 16 >>> attempts, 315 milliseconds, chainlock: 78.340000 ms, CTDB 236.511000 ms >>> [2018/09/18 19:16:15.958368, 0] >>> ../source3/lib/dbwrap/dbwrap_ctdb.c:1207(fetch_locked_internal) >>> db_ctdb_fetch_locked for /var/lib/ctdb/locking.tdb.1 key >>> DE0726567AF1EAFD4A741403000100000000000000000000, chain 3642 needed 15 >>> attempts, 297 milliseconds, chainlock: 58.532000 ms, CTDB 239.124000 ms >>> [2018/09/18 19:16:18.139443, 0] >>> ../source3/lib/dbwrap/dbwrap_ctdb.c:1207(fetch_locked_internal) >>> db_ctdb_fetch_locked for /var/lib/ctdb/locking.tdb.1 key >>> DE0726567AF1EAFD4A741403000100000000000000000000, chain 3642 needed 11 >>> attempts, 128 milliseconds, chainlock: 27.141000 ms, CTDB 101.450000 ms >>> >>> >> Can someone advise what this means and if it's something to be concerned >> about? >> >> I've worked with ctdb set-ups in the past and don't recall seeing these >> messages but that could just be because verbosity is higher with this >> version. >> >> It's a new implementation, I've not had any performance or locking issues >> reported and smbstatus -L looks normal although have noticed load average >> on servers are a little higher than expected >> >> Here is my sanitized smb.comf and ctdbd.conf: >> >> [global] >> workgroup = DOMAIN >> realm = DOMAIN.LOCAL >> security = ADS >> clustering = yes >> idmap config * : range = 16777216-33554431 >> template shell = /bin/false >> kerberos method = secrets only >> winbind use default domain = True >> netbios name = HOSTNAME >> log level = 1 >> create krb5 conf = yes >> encrypt passwords = yes >> unix extensions = No >> min protocol = smb2 >> max protocol = smb2 >> strict allocate = yes >> follow symlinks = yes >> allow insecure wide links = yes >> idmap config DOMAIN : backend = ad >> idmap config DOMAIN : range = 700-199999 >> idmap config DOMAIN : schema_mode = rfc2307 >> idmap config DOMAIN : unix_nss_info = yes >> idmap config DOMAIN:unix_primary_group = yes >> winbind enum users = yes >> winbind enum groups = yes >> winbind normalize names = no >> winbind reconnect delay = 2 >> winbind cache time = 900 >> name resolve order = host >> disable netbios = yes >> fileid:algorithm = fsid >> vfs objects = fileid >> usershare allow guests = yes >> map to guest = Bad User >> hide dot files = Yes >> hide files = /$*/ >> hide special files = yes >> strict sync = No >> >> /etc/ctdb/ctdbd.conf: >> >> CTDB_PUBLIC_ADDRESSES=/etc/ctdb/public_addresses >> CTDB_MANAGES_SAMBA=yes >> CTDB_SAMBA_SKIP_SHARE_CHECK=yes >> CTDB_MANAGES_WINBIND=yes >> CTDB_MANAGES_NFS=yes >> CTDB_NFS_CALLOUT=/etc/ctdb/nfs-ganesha-callout >> CTDB_NFS_CHECKS_DIR=/etc/ctdb/nfs-checks-ganesha.d >> CTDB_NFS_SKIP_SHARE_CHECK=yes >> CTDB_DEBUGLEVEL=NOTICE >> >> > -- > Diese Nachricht wurde von meinem Android-Gerät mit K-9 Mail gesendet. >
Hi David, On Tue, 18 Sep 2018 19:34:25 +0100, David C via samba <samba at lists.samba.org> wrote:> I have a newly implemented two node CTDB cluster running on CentOS 7, Samba > 4.7.1 > > The node network is a direct 1Gb link > > Storage is Cephfs > > ctdb status is OK > > It seems to be running well so far but I'm frequently seeing the following > in my log.smbd: > > [2018/09/18 19:16:15.897742, 0] > > ../source3/lib/dbwrap/dbwrap_ctdb.c:1207(fetch_locked_internal) > > db_ctdb_fetch_locked for /var/lib/ctdb/locking.tdb.1 key > > DE0726567AF1EAFD4A741403000100000000000000000000, chain 3642 needed 16 > > attempts, 315 milliseconds, chainlock: 78.340000 ms, CTDB 236.511000 ms > > [2018/09/18 19:16:15.958368, 0] > > ../source3/lib/dbwrap/dbwrap_ctdb.c:1207(fetch_locked_internal) > > db_ctdb_fetch_locked for /var/lib/ctdb/locking.tdb.1 key > > DE0726567AF1EAFD4A741403000100000000000000000000, chain 3642 needed 15 > > attempts, 297 milliseconds, chainlock: 58.532000 ms, CTDB 239.124000 ms > > [2018/09/18 19:16:18.139443, 0] > > ../source3/lib/dbwrap/dbwrap_ctdb.c:1207(fetch_locked_internal) > > db_ctdb_fetch_locked for /var/lib/ctdb/locking.tdb.1 key > > DE0726567AF1EAFD4A741403000100000000000000000000, chain 3642 needed 11 > > attempts, 128 milliseconds, chainlock: 27.141000 ms, CTDB 101.450000 ms> Can someone advise what this means and if it's something to be concerned > about?As SMB clients perform operations on files, ctdbd's main role is to migrate metadata about those files, such as locking/share-mode info, between nodes of the cluster, The above messages are telling you that ctdbd took more than a pre-defined threshold to migrate a record. This probably means that there is contention between nodes for the file or directory represented by the given key. If this is the case then I would expect to see similar messages in the log on each node. If the numbers get much higher then I would expect to see a performance impact. Is it always the same key? A small group of keys? That is likely to mean contention. If migrations for many different keys are taking longer than the threshold then ctdbd might just be overloaded. You may be able to use the "net tdb locking" command to find out more about the key in question. You'll need to run the command while clients are accessing the file represented by the key. If it is constantly and heavily then that shouldn't be a problem. ;-) If the contention is for the root directory of a share, and you don't actually need lock coherency there, then you could think about using the fileid:algorithm = fsname_norootdir option. However, I note you're using "fileid:algorithm = fsid". If that is needed for Cephfs then the fsname_norootdir option might not be appropriate. You could also consider using the fileid:nolockinode hack if it is appropriate. You should definitely read vfs_fileid(8) before using either of these options. Although clustering has obvious benefits, it doesn't come for free. Dealing with contention can be tricky... :-) peace & happiness, martin
Hi Martin Many thanks for the detailed response. A few follow-ups inline: On Wed, Sep 19, 2018 at 5:19 AM Martin Schwenke <martin at meltin.net> wrote:> Hi David, > > On Tue, 18 Sep 2018 19:34:25 +0100, David C via samba > <samba at lists.samba.org> wrote: > > > I have a newly implemented two node CTDB cluster running on CentOS 7, > Samba > > 4.7.1 > > > > The node network is a direct 1Gb link > > > > Storage is Cephfs > > > > ctdb status is OK > > > > It seems to be running well so far but I'm frequently seeing the > following > > in my log.smbd: > > > > [2018/09/18 19:16:15.897742, 0] > > > ../source3/lib/dbwrap/dbwrap_ctdb.c:1207(fetch_locked_internal) > > > db_ctdb_fetch_locked for /var/lib/ctdb/locking.tdb.1 key > > > DE0726567AF1EAFD4A741403000100000000000000000000, chain 3642 needed 16 > > > attempts, 315 milliseconds, chainlock: 78.340000 ms, CTDB 236.511000 ms > > > [2018/09/18 19:16:15.958368, 0] > > > ../source3/lib/dbwrap/dbwrap_ctdb.c:1207(fetch_locked_internal) > > > db_ctdb_fetch_locked for /var/lib/ctdb/locking.tdb.1 key > > > DE0726567AF1EAFD4A741403000100000000000000000000, chain 3642 needed 15 > > > attempts, 297 milliseconds, chainlock: 58.532000 ms, CTDB 239.124000 ms > > > [2018/09/18 19:16:18.139443, 0] > > > ../source3/lib/dbwrap/dbwrap_ctdb.c:1207(fetch_locked_internal) > > > db_ctdb_fetch_locked for /var/lib/ctdb/locking.tdb.1 key > > > DE0726567AF1EAFD4A741403000100000000000000000000, chain 3642 needed 11 > > > attempts, 128 milliseconds, chainlock: 27.141000 ms, CTDB 101.450000 ms > > > Can someone advise what this means and if it's something to be concerned > > about? > > As SMB clients perform operations on files, ctdbd's main role is to > migrate metadata about those files, such as locking/share-mode info, > between nodes of the cluster, > > The above messages are telling you that ctdbd took more than a > pre-defined threshold to migrate a record. This probably means that > there is contention between nodes for the file or directory represented > by the given key. If this is the case then I would expect to see > similar messages in the log on each node. If the numbers get much > higher then I would expect to see a performance impact. > > Is it always the same key? A small group of keys? That is likely > to mean contention. If migrations for many different keys are taking > longer than the threshold then ctdbd might just be overloaded. >Confirmed always the same key which I suppose is good news?> You may be able to use the "net tdb locking" command to find out more > about the key in question. You'll need to run the command while > clients are accessing the file represented by the key. If it is > constantly and heavily then that shouldn't be a problem. ;-) >Currently reporting: "Record with key DE0726567AF1EAFD4A741403000100000000000000000000 not found." So I guess clients aren't currently accessing it, the messages are fairly frequent so I should be able to catch it. I may just run that command on a loop until it catches it. Is there any other way of translating the key to the inode?> > If the contention is for the root directory of a share, and you don't > actually need lock coherency there, then you could think about using the > > fileid:algorithm = fsname_norootdir > > option. However, I note you're using "fileid:algorithm = fsid". If > that is needed for Cephfs then the fsname_norootdir option might not be > appropriate. >This was a leftover from a short-lived experiment with OCFS2 where I think it was required. I think CephFS should be fine with fsname.> > You could also consider using the fileid:nolockinode hack if it is > appropriate. > > You should definitely read vfs_fileid(8) before using either of these > options. >I'll have a read. Thanks again for your assistance.> > Although clustering has obvious benefits, it doesn't come for > free. Dealing with contention can be tricky... :-) > > peace & happiness, > martin >