Howard, Stewart Jameson
2015-Oct-16 14:44 UTC
[Samba] Problems with TDBs on CTDB-managed Samba instance
Hi All, My site has two separate clustered Samba instances (managed by two independent CTDB instances) running over GPFS. In the last couple of weeks, we have seen a recurring issue that causes the `smbd` process in *one* of these instances to become unresponsive (as seen by CTDB), which results in flapping of CTDB and multiple IP takeover runs. The symptoms that we observe are: 1) Samba becomes unresponsive 2) The output of `smbstatus` starts to show "-1" for each connection where it should be showing user/group information. 3) Samba starts terminating connected sessions and CTDB kills its IP address 4) After some thrashing (Samba restarts, presumably), CTDB is able to recover and start serving again We have noticed that the following messages have started appearing in syslog, as well as in the winbind log on the afflicted cluster: """ [2015/10/16 10:25:30.892468, 0] ../source3/lib/util_tdb.c:313(tdb_log) tdb(<PATH OMMITTED>/gencache_notrans.tdb): tdb_rec_read bad magic 0xd9fee666 at offset=517632 [2015/10/16 10:25:37.827964, 0] ../source3/lib/util_tdb.c:313(tdb_log) tdb(<PATH OMMITTED>/gencache_notrans.tdb): tdb_expand overflow detected current map_size[4294967295] size[124]! """ These messages appear in *great* number, especially the message about "tdb_expand overflow detected." Interestingly, the size of the file it mentions is the exact size in bytes as the presumed array reference index that the error message lists: """ [root@<HOST> lock]# ll gencache_notrans.tdb -rw-r--r-- 1 root root 4294967295 Oct 16 10:39 gencache_notrans.tdb """ On the Samba cluster that is problem-free, this file is a mere ~500K: """ [root at rsgwb2 lock]# ll gencache_notrans.tdb -rw-r--r-- 1 root root 528384 Oct 16 10:40 gencache_notrans.tdb """ Although we poorly understand the cause of the current issue, our suspicion is that it relates somehow to the enormous size of gencache_notrans.tdb. Can anybody comment on what this file is for? Looking at https://www.samba.org/samba/docs/man/Samba-HOWTO-Collection/install.html#tdbdocs I see no description of this file, only of gencache.tdb. Also, if anyone has experience with this type of issue or insight into it, your help is greatly appreciated :) Thank you so much for your time! Stewart Howard
Jeremy Allison
2015-Oct-16 23:53 UTC
[Samba] Problems with TDBs on CTDB-managed Samba instance
On Fri, Oct 16, 2015 at 02:44:36PM +0000, Howard, Stewart Jameson wrote:> Hi All, > > > My site has two separate clustered Samba instances (managed by two independent CTDB instances) running over GPFS. In the last couple of weeks, we have seen a recurring issue that causes the `smbd` process in *one* of these instances to become unresponsive (as seen by CTDB), which results in flapping of CTDB and multiple IP takeover runs. > > > The symptoms that we observe are: > > > 1) Samba becomes unresponsive > > > 2) The output of `smbstatus` starts to show "-1" for each connection where it should be showing user/group information. > > > 3) Samba starts terminating connected sessions and CTDB kills its IP address > > > 4) After some thrashing (Samba restarts, presumably), CTDB is able to recover and start serving again > > > We have noticed that the following messages have started appearing in syslog, as well as in the winbind log on the afflicted cluster: > > > """ > > [2015/10/16 10:25:30.892468, 0] ../source3/lib/util_tdb.c:313(tdb_log) > tdb(<PATH OMMITTED>/gencache_notrans.tdb): tdb_rec_read bad magic 0xd9fee666 at offset=517632 > > > [2015/10/16 10:25:37.827964, 0] ../source3/lib/util_tdb.c:313(tdb_log) > tdb(<PATH OMMITTED>/gencache_notrans.tdb): tdb_expand overflow detected current map_size[4294967295] size[124]!tdb_rec_read bad magic - this means a corrupted tdb database. Can you shutdown, remove these tdb's and restart ?
Howard, Stewart Jameson
2015-Oct-17 16:13 UTC
[Samba] Problems with TDBs on CTDB-managed Samba instance
Hi Jeremy, Thanks so much for your reply! As a matter of fact, we did just that around 3:45p yesterday when our CTDB cluster was unable to self-heal from the latest in this series of failover events. Here's how the situation went down: 1) We saw flapping identical to that described in my original post 2) After about 30 minutes of waiting, CTDB was just spinning with `smbd` repeatedly failing its health checks. 3) We stopped CTDB on the cluster with `onnode all service ctdb stop` 4) We moved the following files out of the way: gencache_notrans.tdb gencache.tdb mutex.tdb 5) We started CTDB again with `onnode all service ctdb start` At that point, the CTDB cluster came back up successfully and all has been quiet since yesterday afternoon. We notice that gencache_notrans.tdb has not even begun to move toward its (supposedly) pathological high-water-mark size of ~4G. In fact, we have a script watching its size right now and it is steady at ~500K since the restart yesterday: """ [root@<HOST> lock]# ll gencache_notrans.tdb -rw-r--r-- 1 root root 532480 Oct 17 12:01 gencache_notrans.tdb """ Because of the current steady size of this file compared to its repeated, intermittent, and rapid inflation, we suspect that there is some operational condition which *causes* the corruption and which we're running into with some regularity. Our cluster is attached to a rather large ADS domain and `strings gencache_notrans.tdb|less` during the trouble reveals a long series of Windows SID entries followed eventually by a *very large* number of the ASCII character "B" (presumably going all the way to the end of the file. Our current suspicion in that there is some ADS user whose record, when ingested, somehow corrupts the TDB. Our investigations into the last *successfully-ingested* SID in the corrupt TDB will continue on Monday morning. Of course, we may be wrong in this hypothesis. Can you (or anyone else on the list) comment on the apparent correctness of this assessment and possibly shed light on what scenarios are known to result in TDB corruption? The actions we took yesterday seem to have alleviated the problem for now, however, we are very interested in taking any steps necessary to prevent its recurrence in the future. Again, thank you so much for your time :) Stewart Howard ________________________________________ From: Jeremy Allison <jra at samba.org> Sent: Friday, October 16, 2015 7:53 PM To: Howard, Stewart Jameson Cc: samba at lists.samba.org Subject: Re: [Samba] Problems with TDBs on CTDB-managed Samba instance On Fri, Oct 16, 2015 at 02:44:36PM +0000, Howard, Stewart Jameson wrote:> Hi All, > > > My site has two separate clustered Samba instances (managed by two independent CTDB instances) running over GPFS. In the last couple of weeks, we have seen a recurring issue that causes the `smbd` process in *one* of these instances to become unresponsive (as seen by CTDB), which results in flapping of CTDB and multiple IP takeover runs. > > > The symptoms that we observe are: > > > 1) Samba becomes unresponsive > > > 2) The output of `smbstatus` starts to show "-1" for each connection where it should be showing user/group information. > > > 3) Samba starts terminating connected sessions and CTDB kills its IP address > > > 4) After some thrashing (Samba restarts, presumably), CTDB is able to recover and start serving again > > > We have noticed that the following messages have started appearing in syslog, as well as in the winbind log on the afflicted cluster: > > > """ > > [2015/10/16 10:25:30.892468, 0] ../source3/lib/util_tdb.c:313(tdb_log) > tdb(<PATH OMMITTED>/gencache_notrans.tdb): tdb_rec_read bad magic 0xd9fee666 at offset=517632 > > > [2015/10/16 10:25:37.827964, 0] ../source3/lib/util_tdb.c:313(tdb_log) > tdb(<PATH OMMITTED>/gencache_notrans.tdb): tdb_expand overflow detected current map_size[4294967295] size[124]!tdb_rec_read bad magic - this means a corrupted tdb database. Can you shutdown, remove these tdb's and restart ?