Hi Bob,
On Tue, 27 Oct 2020 15:09:34 +1100, Martin Schwenke via samba
<samba at lists.samba.org> wrote:
> On Sun, 25 Oct 2020 20:44:07 -0400, Robert Buck <robert.buck at
som.com>
> wrote:
> 
> > We use a Golang-based lock tool that we wrote for CTDB. That tool
interacts
> > with our 3.4 etcd cluster, and follows the requirements specified in
the
> > project.
> > 
> > Question, does the external command line tool get called when LMASTER
and
> > RECMASTER are false? Given a scenario where we have a set of processes
that
> > have it set to false, then others that have it set to true, does the
> > locking tool get called when they're set to false?  
> 
> Indeed it does.  There are 2 current things conspiring against you:
> 
> * At the start of each recovery a recovery lock consistency check is
>   done. Unfortunately, this means the recovery lock can't be left unset
>   on nodes that do not have the recmaster capability because then the
>   consistency check would fail.
> 
> * At the end of recovery, if the recovery lock is set, all nodes will
>   attempt to take the recovery lock and will expect to fail (on the
>   leader/master too, since it is being taken from a different process on
>   the leader/master).
> 
>   This is meant to be a sanity check but, to be honest, I'm not sure
>   whether it really adds any value.  A better option might be to only
>   accept recovery-related controls from the current leader/master node,
>   banning any other node that is stupid enough to send such a control.
> 
> I need to think about his more...
> 
> One of the problems is that the ideas of recovery master and recovery
> lock are historical and they are somewhat dated compared to current
> clustering concepts. Recovery master should really be "cluster
leader"
> and the lock should be "cluster lock".  If we clearly change our
> approach in that direction then it makes no sense to check a cluster
> lock at recovery time.
> 
> I have a branch that does the technical (but not documentation) parts
> of switching to cluster leader and lock... but more work is needed
> before this is ready to merge.
> 
> > IF you say the lock tool still gets called in both cases, then the
docs
> > need to be updated, and we on our end need to add a special config
file
> > option to reject lock acquisitions from those nodes that have the CTDB
> > options set to false, permitting only those nodes set to true to
acquire
> > etcd locks.  
> 
> Well, the documentation (ctdb(7) manual page) does say:
> 
>   CTDB does sanity checks to ensure that the recovery lock is held as
>   expected.
> 
> ;-)
> 
> OK, that's pretty weak!
> 
> I'll try to get some of Amitay's time to discuss what we should do
> here...
There are a few possible changes but none of them would really fix
things properly.  We have situation where the recovery lock (which used
to be released at the end of each recovery but is now released on
election loss) is almost a cluster lock, so we really shouldn't be
sanity checking it at the end of recovery.  However, there's currently
no other sane place to check it.
So, as you say, the docs need to be updated.  Do you think it would be
enough to add to the single sentence above from ctdb(7)?  Something
like the following:
  CTDB does some recovery lock sanity checks. At the beginning of
  each recovery each node checks that its recovery lock setting is
  consistent with that of the recovery master.  At the end of each
  recovery each node attempts to take the recovery lock and expects
  this to fail (because the lock is already held by another process).
  These checks are done unconditionally when the recovery lock is
  set, without regard to node capabilities (see CAPABILITIES below).
How's that?
Thanks...
peace & happiness,
martin