Very helpful. Thank you, Martin. I'd like to share the information below with you and solicit your fine feedback :-) I provide additional detail in case there is something else you feel strongly we should consider. We made some changes last night, let me share those with you. The error that is repeating itself and causing these failures is: Takeover run starting RELEASE_IP 10.200.1.230 failed on node 0, ret=-1 Assigning banning credits to node 0 takeover run failed, ret=-1 ctdb_takeover_run() failed Takeover run unsuccessful Node 0 reached 4 banning credits - banning it for 300 seconds Banning node 0 for 300 seconds Unassigned IP 10.206.2.124 can be served by this node Unassigned IP 10.200.1.230 can be served by this node IP 10.206.2.124 incorrectly on an interface Last night we truncated the public_addresses file, then everything started working. And so we've been rereading the doc on the public addresses file. So it may be we have gravely misunderstood the *public_addresses* file, we never read that part of the documentation carefully. The *nodes* file made perfect sense, and the point we missed is that CTDB is using floating (unreserved/unused) addresses and assigning them to a SECOND public interface (aliases). We did not plan a private subnet for the node traffic, and a separate public subnet for the client traffic. More on the changes we made last night in a sec... Let me explain our architecture, for context. Would love some feedback, expressed concerns, etc. We have built a geo-distributed SMB file system, it is deployed in AWS, over four regions globally, internally uses ObjectiveFS as a backend shared file system and cache, and uses a custom ETCD locking helper (written in Golang). The instances have only one network interface, private; they do not have a second interface (possibly our mistake). The existing private interface is AWS assigned, static, and cannot be reassigned (obviously). Initial testing of this is promising; leadership election is not instantaneous as you'd expect, it takes upwards of 5 seconds, b/c etcd is operating as a geo-distributed fully meshed cluster, and the current leader could be a continent away, but not bad. Here is our mistake... The initial *public_addresses* file had identical addresses as the *nodes* file, containing the private IP addresses assigned by AWS. Not good, right? The error messages shown, above, were the result. However, once we truncated the file, echo '' > /etc/ctdb/public_addresses ctdb status Then the CTDB status showed all nodes as healthy: Number of nodes:2 pnn:0 10.200.1.230 OK pnn:1 10.206.2.124 OK (THIS NODE) Generation:1547616286 Size:2 hash:0 lmaster:0 hash:1 lmaster:1 Recovery mode:NORMAL (0) Recovery master:0 And after these changes the logs simply have these messages periodically: Disabling takeover runs for 60 seconds Reenabling takeover runs *Is this normal?* (This is a modest test rig mind you, and only one Samba process per region. In prod it will be several regions, multiple processes, etc.) Really appreciate your help, Martin. Thank you! On Wed, Aug 5, 2020 at 6:53 PM Martin Schwenke <martin at meltin.net> wrote:> Hi Bob, > > On Wed, 5 Aug 2020 17:10:11 -0400, Robert Buck via samba > <samba at lists.samba.org> wrote: > > > Could I impose upon someone to provide some guidance? Some hint? Thank > you > > Any time! :-) > > > Is a shared file system actually required? If etcd is used to manage the > > global recovery lock, is there any need at that point for a shared file > > system? > > > > In other words, are there samba or CTDB files (state) that must be on a > > shared file system, or can each clustered host simply have these files > > locally? > > > > What must be shared? What can be optionally shared? > > The only thing that CTDB uses the shared filesystem for is the recovery > lock, so if you're using etcd for the recovery lock then CTDB will not > be using the shared filesystem. > > Clustered Samba (smbd in this case) expects to serve files to client > from a shared filesystem. Although some of the metadata is stored in > in CTDB, smbd makes some assumptions about the underlying filesystem > (e.g. I/O coherence is required when using POSIX locking). > > > The doc is not clear on this. > > I have updated the wiki to mention this: > > > https://wiki.samba.org/index.php/Setting_up_a_cluster_filesystem#Checking_lock_coherence > > The page about ping_pong was already there but it doesn't look like > there was a link to it. > > I also need to update the ctdb(7) manual page to point to the wiki. > > > In our scenario, when we attempt to start up a second node, it always > goes > > into a banned state. If we shut down the healthy node and restart CTDB on > > the "failed node" it now works. We're trying to understand this. > > One reason I can think of for this is the recovery lock check during > recovery. When recovery completes and CTDB is setting the recovery > mode back to "normal" on each node, it does a sanity check where it > attempts to take the recovery lock. It should never be able to do this > because the lock should already be held by another process on the > master/leader node. > > I've documented a couple of reasons, unrelated to the recovery lock, > why CTDB can behave badly: > > > https://wiki.samba.org/index.php/Basic_CTDB_configuration#Troubleshooting > > So, 2 questions: > > * Does the 2nd node still get banned if you disable the recovery lock? > > If not then the problem is clearly with the recovery lock. > > * What do the logs say about the reason for banning the node? > > peace & happiness, > martin > >-- BOB BUCK SENIOR PLATFORM SOFTWARE ENGINEER SKIDMORE, OWINGS & MERRILL 7 WORLD TRADE CENTER 250 GREENWICH STREET NEW YORK, NY 10007 T (212) 298-9624 ROBERT.BUCK at SOM.COM
Hi Bob, On Thu, 6 Aug 2020 06:55:31 -0400, Robert Buck <robert.buck at som.com> wrote:> And so we've been rereading the doc on the public addresses file. So it may > be we have gravely misunderstood the *public_addresses* file, we never read > that part of the documentation carefully. The *nodes* file made perfect > sense, and the point we missed is that CTDB is using floating > (unreserved/unused) addresses and assigning them to a SECOND public > interface (aliases). We did not plan a private subnet for the node traffic, > and a separate public subnet for the client traffic.> [...]> Here is our mistake... The initial *public_addresses* file had identical > addresses as the *nodes* file, containing the private IP addresses assigned > by AWS. Not good, right? The error messages shown, above, were the result.Yep, that would definitely cause chaos. ;-) CTDB is really designed to have the node traffic go over a private network. There is no authentication between nodes (other than checking that a connecting node is listed in the nodes file) and there is no encryption between nodes. Contents of files will not be transferred between nodes by CTDB if filenames are sensitive then they could be exposed if they are not on a private network. In the future we plan to have some authentication between nodes when they connect. Most likely a shared secret used to generate something from the nodes file.> [...] > > And after these changes the logs simply have these messages periodically: > > Disabling takeover runs for 60 seconds > Reenabling takeover runs > > *Is this normal?*How frequently are these messages logged? They should occur as nodes join but should stop after that. If they continue are there any clues indicating why takeover runs occurs? A takeover run is just what CTDB currently calls a recalculation of the floating IP addresses for fail-over. peace & happiness, martin
On Sat, Aug 8, 2020 at 2:52 AM Martin Schwenke <martin at meltin.net> wrote:> Hi Bob, > > On Thu, 6 Aug 2020 06:55:31 -0400, Robert Buck <robert.buck at som.com> > wrote: > > > And so we've been rereading the doc on the public addresses file. So it > may > > be we have gravely misunderstood the *public_addresses* file, we never > read > > that part of the documentation carefully. The *nodes* file made perfect > > sense, and the point we missed is that CTDB is using floating > > (unreserved/unused) addresses and assigning them to a SECOND public > > interface (aliases). We did not plan a private subnet for the node > traffic, > > and a separate public subnet for the client traffic. > > > [...] > > > Here is our mistake... The initial *public_addresses* file had identical > > addresses as the *nodes* file, containing the private IP addresses > assigned > > by AWS. Not good, right? The error messages shown, above, were the > result. > > Yep, that would definitely cause chaos. ;-) > > CTDB is really designed to have the node traffic go over a private > network. There is no authentication between nodes (other than checking > that a connecting node is listed in the nodes file) and there is no > encryption between nodes. Contents of files will not be transferred > between nodes by CTDB if filenames are sensitive then they could be > exposed if they are not on a private network. > > In the future we plan to have some authentication between nodes when > they connect. Most likely a shared secret used to generate something > from the nodes file. > > > [...] > > > > And after these changes the logs simply have these messages periodically: > > > > Disabling takeover runs for 60 seconds > > Reenabling takeover runs > > > > *Is this normal?* > > How frequently are these messages logged? They should occur as nodes > join but should stop after that. If they continue are there any clues > indicating why takeover runs occurs? A takeover run is just what CTDB > currently calls a recalculation of the floating IP addresses for > fail-over. >Hi Martin, thank you for your helpful feedback, this is great. Yes, those log messages, they were occurring once per second (precisely). Then after several hours they stopped after these messages in the log: ctdbd[1220]: 10.206.2.124:4379: node 10.200.1.230:4379 is dead: 0 connected ctdbd[1220]: Tearing down connection to dead node :0 ctdb-recoverd[1236]: Current recmaster node 0 does not have CAP_RECMASTER, but we (node 1) have - force an election ctdbd[1220]: Recovery mode set to ACTIVE ctdbd[1220]: This node (1) is now the recovery master ctdb-recoverd[1236]: Election period ended ctdb-recoverd[1236]: Node:1 was in recovery mode. Start recovery process ctdb-recoverd[1236]: ../../ctdb/server/ctdb_recoverd.c:1347 Starting do_recovery ctdb-recoverd[1236]: Attempting to take recovery lock (!/usr/local/bin/lockctl elect --endpoints REDACTED:2379 SM ctdbd[1220]: High RECLOCK latency 4.268180s for operation recd reclock ctdb-recoverd[1236]: Recovery lock taken successfully ctdb-recoverd[1236]: ../../ctdb/server/ctdb_recoverd.c:1422 Recovery initiated due to problem with node 0 ctdb-recoverd[1236]: ../../ctdb/server/ctdb_recoverd.c:1447 Recovery - created remote databases ctdb-recoverd[1236]: ../../ctdb/server/ctdb_recoverd.c:1476 Recovery - updated flags ctdb-recoverd[1236]: Set recovery_helper to "/usr/libexec/ctdb/ctdb_recovery_helper" ... recover database 0x2ca251cf ... Thaw db: smbXsrv_client_global.tdb generation 999520140 Release freeze handle for db smbXsrv_client_global.tdb 19 of 19 databases recovered Recovery mode set to NORMAL ... No nodes available to host public IPs yet ... Reenabling recoveries after timeout ... Then it's a clean syslog after that. Thank you!> > peace & happiness, > martin > >-- BOB BUCK SENIOR PLATFORM SOFTWARE ENGINEER SKIDMORE, OWINGS & MERRILL 7 WORLD TRADE CENTER 250 GREENWICH STREET NEW YORK, NY 10007 T (212) 298-9624 ROBERT.BUCK at SOM.COM
Possibly Parallel Threads
- CTDB question about "shared file system"
- [ctdb]Unable to run startrecovery event(if mail content is encrypted, please see the attached file)
- CTDB Path
- CTDB node stucks in " ctdb-eventd[13184]: 50.samba: samba not listening on TCP port 445"
- ctdb samba and winbind event problem