thr3ads.net - samba - [Samba] Problems with TDBs on CTDB-managed Samba instance [Oct 2015]

If this information is useful, please help other people find it:
Share via:

Howard, Stewart Jameson

2015-Oct-17 16:13 UTC

[Samba] Problems with TDBs on CTDB-managed Samba instance

Hi Jeremy,

Thanks so much for your reply!  As a matter of fact, we did just that around
3:45p yesterday when our CTDB cluster was unable to self-heal from the latest in
this series of failover events.  Here's how the situation went down:

1)  We saw flapping identical to that described in my original post

2)  After about 30 minutes of waiting, CTDB was just spinning with `smbd`
repeatedly failing its health checks.

3)  We stopped CTDB on the cluster with `onnode all service ctdb stop`

4)  We moved the following files out of the way:

gencache_notrans.tdb
gencache.tdb
mutex.tdb

5)  We started CTDB again with `onnode all service ctdb start`

At that point, the CTDB cluster came back up successfully and all has been quiet
since yesterday afternoon.  We notice that gencache_notrans.tdb has not even
begun to move toward its (supposedly) pathological high-water-mark size of ~4G. 
In fact, we have a script watching its size right now and it is steady at ~500K
since the restart yesterday:

"""
[root@<HOST> lock]# ll gencache_notrans.tdb
-rw-r--r-- 1 root root 532480 Oct 17 12:01 gencache_notrans.tdb
""" 

Because of the current steady size of this file compared to its repeated,
intermittent, and rapid inflation, we suspect that there is some operational
condition which *causes* the corruption and which we're running into with
some regularity.  Our cluster is attached to a rather large ADS domain and
`strings gencache_notrans.tdb|less` during the trouble reveals a long series of
Windows SID entries followed eventually by a *very large* number of the ASCII
character "B" (presumably going all the way to the end of the file. 
Our current suspicion in that there is some ADS user whose record, when
ingested, somehow corrupts the TDB.  Our investigations into the last
*successfully-ingested* SID in the corrupt TDB will continue on Monday morning.

Of course, we may be wrong in this hypothesis.  Can you (or anyone else on the
list) comment on the apparent correctness of this assessment and possibly shed
light on what scenarios are known to result in TDB corruption?  The actions we
took yesterday seem to have alleviated the problem for now, however, we are very
interested in taking any steps necessary to prevent its recurrence in the
future.

Again, thank you so much for your time  :)

Stewart Howard
________________________________________
From: Jeremy Allison <jra at samba.org>
Sent: Friday, October 16, 2015 7:53 PM
To: Howard, Stewart Jameson
Cc: samba at lists.samba.org
Subject: Re: [Samba] Problems with TDBs on CTDB-managed Samba instance

On Fri, Oct 16, 2015 at 02:44:36PM +0000, Howard, Stewart Jameson
wrote:> Hi All,
>
>
> My site has two separate clustered Samba instances (managed by two
independent CTDB instances) running over GPFS.  In the last couple of weeks, we
have seen a recurring issue that causes the `smbd` process in *one* of these
instances to become unresponsive (as seen by CTDB), which results in flapping of
CTDB and multiple IP takeover runs.
>
>
> The symptoms that we observe are:
>
>
> 1)  Samba becomes unresponsive
>
>
> 2)  The output of `smbstatus` starts to show "-1" for each
connection where it should be showing user/group information.
>
>
> 3)  Samba starts terminating connected sessions and CTDB kills its IP
address
>
>
> 4)  After some thrashing (Samba restarts, presumably), CTDB is able to
recover and start serving again
>
>
> We have noticed that the following messages have started appearing in
syslog, as well as in the winbind log on the afflicted cluster:
>
>
> """
>
> [2015/10/16 10:25:30.892468,  0] ../source3/lib/util_tdb.c:313(tdb_log)
>   tdb(<PATH OMMITTED>/gencache_notrans.tdb): tdb_rec_read bad magic
0xd9fee666 at offset=517632
>
>
> [2015/10/16 10:25:37.827964,  0] ../source3/lib/util_tdb.c:313(tdb_log)
>   tdb(<PATH OMMITTED>/gencache_notrans.tdb): tdb_expand overflow
detected current map_size[4294967295] size[124]!
tdb_rec_read bad magic  - this means a corrupted tdb
database.

Can you shutdown, remove these tdb's and restart ?

Jeremy Allison

2015-Oct-18 00:31 UTC

head link

[Samba] Problems with TDBs on CTDB-managed Samba instance

On Sat, Oct 17, 2015 at 04:13:30PM +0000, Howard, Stewart Jameson
wrote:> Hi Jeremy,
> 
> Thanks so much for your reply!  As a matter of fact, we did just that
around 3:45p yesterday when our CTDB cluster was unable to self-heal from the
latest in this series of failover events.  Here's how the situation went
down:
> 
> 1)  We saw flapping identical to that described in my original post
> 
> 2)  After about 30 minutes of waiting, CTDB was just spinning with `smbd`
repeatedly failing its health checks.
> 
> 3)  We stopped CTDB on the cluster with `onnode all service ctdb stop`
> 
> 4)  We moved the following files out of the way:
> 
> gencache_notrans.tdb
> gencache.tdb
> mutex.tdb
> 
> 5)  We started CTDB again with `onnode all service ctdb start`
> 
> At that point, the CTDB cluster came back up successfully and all has been
quiet since yesterday afternoon.  We notice that gencache_notrans.tdb has not
even begun to move toward its (supposedly) pathological high-water-mark size of
~4G.  In fact, we have a script watching its size right now and it is steady at
~500K since the restart yesterday:
> 
> """
> [root@<HOST> lock]# ll gencache_notrans.tdb
> -rw-r--r-- 1 root root 532480 Oct 17 12:01 gencache_notrans.tdb
> """ 
> 
> Because of the current steady size of this file compared to its repeated,
intermittent, and rapid inflation, we suspect that there is some operational
condition which *causes* the corruption and which we're running into with
some regularity.  Our cluster is attached to a rather large ADS domain and
`strings gencache_notrans.tdb|less` during the trouble reveals a long series of
Windows SID entries followed eventually by a *very large* number of the ASCII
character "B" (presumably going all the way to the end of the file. 
Our current suspicion in that there is some ADS user whose record, when
ingested, somehow corrupts the TDB.  Our investigations into the last
*successfully-ingested* SID in the corrupt TDB will continue on Monday morning.
> 
> Of course, we may be wrong in this hypothesis.  Can you (or anyone else on
the list) comment on the apparent correctness of this assessment and possibly
shed light on what scenarios are known to result in TDB corruption?  The actions
we took yesterday seem to have alleviated the problem for now, however, we are
very interested in taking any steps necessary to prevent its recurrence in the
future.
> 
> Again, thank you so much for your time  :)
So you have a copy of the corrupted tdb files ?

I don't know of any outstanding bugs that can cause
tdb corruption I'm afraid.

Volker Lendecke

2015-Oct-18 08:24 UTC

head link

[Samba] Problems with TDBs on CTDB-managed Samba instance

On Sat, Oct 17, 2015 at 04:13:30PM +0000, Howard, Stewart Jameson
wrote:> gencache_notrans.tdb
> gencache.tdb
> mutex.tdb
Just a side-remark: These tdbs have nothing to do with ctdb,
they are purely local.
> Because of the current steady size of this file compared
> to its repeated, intermittent, and rapid inflation, we
> suspect that there is some operational condition which
> *causes* the corruption and which we're running into with
> some regularity.  Our cluster is attached to a rather
> large ADS domain and `strings gencache_notrans.tdb|less`
> during the trouble reveals a long series of Windows SID
> entries followed eventually by a *very large* number of
> the ASCII character "B" (presumably going all the way to
> the end of the file.  Our current suspicion in that there
> is some ADS user whose record, when ingested, somehow
> corrupts the TDB.  Our investigations into the last
> *successfully-ingested* SID in the corrupt TDB will
> continue on Monday morning.
We should never go beyond a few MBs for
gencache_notrans.tdb.

What version of Samba are you running, in particular what
version of tdb? There have been significant improvements in
tdb's freelist handling that should keep tdbs a lot smaller.
These changes came with Samba 4.2.

Volker

-- 
SerNet GmbH, Bahnhofsallee 1b, 37081 Göttingen
phone: +49-551-370000-0, fax: +49-551-370000-9
AG Göttingen, HRB 2816, GF: Dr. Johannes Loxen
http://www.sernet.de, mailto:kontakt at sernet.de

Howard, Stewart Jameson

2015-Oct-20 16:07 UTC

head link

[Samba] Problems with TDBs on CTDB-managed Samba instance

Hi Volker and Jeremy,

Thanks for looking at my thread  :)

It looks like we're running version 1.2.10-1 of libtdb:

"""
[root@<HOST> bin]# rpm -qa|grep -i libtdb
libtdb-1.2.10-1.el6.x86_64
"""

As for Samba, we're running the Sernet distribution at version 4.1.6-7:

"""
[root@<HOST> bin]# rpm -qa|grep -i samba
sernet-samba-libs-4.1.6-7.el6.x86_64
sernet-samba-debuginfo-4.1.6-7.el6.x86_64
sernet-samba-ad-4.1.6-7.el6.x86_64
sernet-samba-client-4.1.6-7.el6.x86_64
sernet-samba-libsmbclient-devel-4.1.6-7.el6.x86_64
sernet-samba-common-4.1.6-7.el6.x86_64
sernet-samba-4.1.6-7.el6.x86_64
sernet-samba-libwbclient-devel-4.1.6-7.el6.x86_64
sernet-samba-winbind-4.1.6-7.el6.x86_64
sernet-samba-libsmbclient0-4.1.6-7.el6.x86_64
"""

I think Jeremy asked if I have a copy of the old, corrupted
gencache_notrans.tdb.  I do have the file, but I will have to check on the
possibility of posting it, since it contains domain SIDs internal to our
organization.  Also, the size of the file is ~4G, which is over the limit that
our mail servers will handle.  If it turns out that I'm able to provide the
file, we might have to find some alternative way for me to post it.  In the
meantime, is there any analysis of this file that you guys (or anyone else) can
suggest to hunt for clues as to the cause?

Thank you so much for all of your help!!

Stewart
________________________________________
From: Volker Lendecke <Volker.Lendecke at SerNet.DE>
Sent: Sunday, October 18, 2015 4:24 AM
To: Howard, Stewart Jameson
Cc: Jeremy Allison; samba at lists.samba.org
Subject: Re: [Samba] Problems with TDBs on CTDB-managed Samba instance

On Sat, Oct 17, 2015 at 04:13:30PM +0000, Howard, Stewart Jameson
wrote:> gencache_notrans.tdb
> gencache.tdb
> mutex.tdb
Just a side-remark: These tdbs have nothing to do with ctdb,
they are purely local.
> Because of the current steady size of this file compared
> to its repeated, intermittent, and rapid inflation, we
> suspect that there is some operational condition which
> *causes* the corruption and which we're running into with
> some regularity.  Our cluster is attached to a rather
> large ADS domain and `strings gencache_notrans.tdb|less`
> during the trouble reveals a long series of Windows SID
> entries followed eventually by a *very large* number of
> the ASCII character "B" (presumably going all the way to
> the end of the file.  Our current suspicion in that there
> is some ADS user whose record, when ingested, somehow
> corrupts the TDB.  Our investigations into the last
> *successfully-ingested* SID in the corrupt TDB will
> continue on Monday morning.
We should never go beyond a few MBs for
gencache_notrans.tdb.

What version of Samba are you running, in particular what
version of tdb? There have been significant improvements in
tdb's freelist handling that should keep tdbs a lot smaller.
These changes came with Samba 4.2.

Volker

--
SerNet GmbH, Bahnhofsallee 1b, 37081 Göttingen
phone: +49-551-370000-0, fax: +49-551-370000-9
AG Göttingen, HRB 2816, GF: Dr. Johannes Loxen
http://www.sernet.de, mailto:kontakt at sernet.de

Reasonably Related Threads

Search for more reasonably related threads

samba - Oct 2015 - Problems with TDBs on CTDB-managed Samba instance

[Samba] Problems with TDBs on CTDB-managed Samba instance

[Samba] Problems with TDBs on CTDB-managed Samba instance

[Samba] Problems with TDBs on CTDB-managed Samba instance

[Samba] Problems with TDBs on CTDB-managed Samba instance

Reasonably Related Threads