thr3ads.net - samba - [Samba] CTDB node address = IPADDR in ctdb.conf [Oct 2025]

If this information is useful, please help other people find it:
Share via:

Bailey Allison

2025-Oct-14 14:12 UTC

[Samba] CTDB node address = IPADDR in ctdb.conf

Hey Martin,

Thanks for reaching out! Sure thing, here is the ctdb.conf in question:

[legacy]
 ??? #realtime scheduling = true will cause ctdb to fail when docker 
containers are running
 ??? realtime scheduling = false

[cluster]
 ??? node address = 192.168.45.230
 ??? recovery lock = !/usr/libexec/ctdb/ctdb_mutex_ceph_rados_helper 
ceph client.samba cephfs.cephfs.meta ctdblock

I did do some messing around and gave it an incorrect IP 
(192.168.45.2322), and it did error and stop CTDB per the code (due to 
invalid IP). Just appears when giving an IP address it's not changing 
the behaviour.

But perhaps it's my understanding of it that is incorrect.

To give a bit more detail, we are using the ingress service from 
cephadm, and CTDB on the same nodes. This ingress service utilizes the 
sysctl value mentioned, net.ipv4.ip_nonlocal_bind=1.

What is eventually occurring is CTDB crashing due to being unable to 
assign the VIP to the interface on the host.

Once turning the value back to 0, CTDB does function correctly too.

It may just be that there is another completely separate issue we are 
running into, but I was just hopeful based on the docs mentioning that 
specific value it may have just been that.

Here is some logs from right before the crash too if that helps:

2025-10-13T18:41:42.058594-03:00 adm-gw1 ctdbd[479406]: startup event OK 
- enabling monitoring
2025-10-13T18:41:42.058654-03:00 adm-gw1 ctdbd[479406]: Set runstate to 
RUNNING (5)
2025-10-13T18:41:42.097671-03:00 adm-gw1 ctdb-recoverd[479490]: Takeover 
run completed successfully
2025-10-13T18:41:42.730948-03:00 adm-gw1 ctdb-recoverd[479490]: IP 
192.168.45.235 incorrectly on an interface
2025-10-13T18:41:42.731016-03:00 adm-gw1 ctdb-recoverd[479490]: Trigger 
takeoverrun
2025-10-13T18:41:42.731208-03:00 adm-gw1 ctdb-recoverd[479490]: Takeover 
run starting
2025-10-13T18:41:42.742354-03:00 adm-gw1 ctdb-takeover[479783]: No nodes 
available to host public IPs yet
2025-10-13T18:41:42.787980-03:00 adm-gw1 ctdb-recoverd[479490]: Takeover 
run completed successfully
2025-10-13T18:41:43.732009-03:00 adm-gw1 ctdb-recoverd[479490]: IP 
192.168.45.235 incorrectly on an interface
2025-10-13T18:41:43.732077-03:00 adm-gw1 ctdb-recoverd[479490]: Trigger 
takeoverrun
2025-10-13T18:41:43.732294-03:00 adm-gw1 ctdb-recoverd[479490]: Takeover 
run starting
2025-10-13T18:41:43.745921-03:00 adm-gw1 ctdb-takeover[479794]: No nodes 
available to host public IPs yet
2025-10-13T18:41:43.796166-03:00 adm-gw1 ctdb-recoverd[479490]: Takeover 
run completed successfully
2025-10-13T18:41:44.210461-03:00 adm-gw1 ctdbd[479406]: monitor event OK 
- node re-enabled
2025-10-13T18:41:44.211218-03:00 adm-gw1 ctdbd[479406]: Node became 
HEALTHY. Ask recovery master to reallocate IPs
2025-10-13T18:41:44.732792-03:00 adm-gw1 ctdb-recoverd[479490]: 
Unassigned IP 192.168.45.235 can be served by this node
2025-10-13T18:41:44.732964-03:00 adm-gw1 ctdb-recoverd[479490]: IP 
192.168.45.235 incorrectly on an interface
2025-10-13T18:41:44.732987-03:00 adm-gw1 ctdb-recoverd[479490]: Trigger 
takeoverrun
2025-10-13T18:41:44.733160-03:00 adm-gw1 ctdb-recoverd[479490]: Takeover 
run starting
2025-10-13T18:41:44.769369-03:00 adm-gw1 ctdbd[479406]: 
../../ctdb/server/ctdb_takeover.c:797 Doing updateip for IP 
192.168.45.235 already on an interface
2025-10-13T18:41:44.769448-03:00 adm-gw1 ctdbd[479406]: Update of IP 
192.168.45.235/16 from interface __none__ to ens18
2025-10-13T18:41:44.788619-03:00 adm-gw1 ctdb-eventd[479407]: 
10.interface: ERROR: Unable to determine interface for IP 192.168.45.235
2025-10-13T18:41:44.788689-03:00 adm-gw1 ctdb-eventd[479407]: updateip 
event failed
2025-10-13T18:41:44.788847-03:00 adm-gw1 ctdbd[479406]: Failed update of 
IP 192.168.45.235 from interface __none__ to ens18
2025-10-13T18:41:44.788945-03:00 adm-gw1 ctdbd[479406]: 
==============================================================2025-10-13T18:41:44.788966-03:00
adm-gw1 ctdbd[479406]: INTERNAL ERROR:
Signal 11: Segmentation fault in? () () pid 479406 (4.21.3)
2025-10-13T18:41:44.788985-03:00 adm-gw1 ctdbd[479406]: If you are 
running a recent Samba version, and if you think this problem is not yet 
fixed in the latest versions, please consider reporting this bug, see 
https://wiki.samba.org/index.php/Bug_Reporting
2025-10-13T18:41:44.789003-03:00 adm-gw1 ctdbd[479406]: 
==============================================================2025-10-13T18:41:44.789016-03:00
adm-gw1 ctdbd[479406]: PANIC (pid
479406): Signal 11: Segmentation fault in 4.21.3
2025-10-13T18:41:44.789489-03:00 adm-gw1 ctdbd[479406]: BACKTRACE: 21 
stack frames:
 ?#0 /usr/lib64/samba/libgenrand-private-samba.so(log_stack_trace+0x34) 
[0x7fee28193624]
 ?#1 /usr/lib64/samba/libgenrand-private-samba.so(smb_panic+0xd) 
[0x7fee28193e0d]
 ?#2 /usr/lib64/samba/libgenrand-private-samba.so(+0x2fd8) [0x7fee28193fd8]
 ?#3 /lib64/libc.so.6(+0x3ebf0) [0x7fee27e3ebf0]
 ?#4 /usr/sbin/ctdbd(+0x563c7) [0x55e82cb8a3c7]
 ?#5 /usr/sbin/ctdbd(+0x516a0) [0x55e82cb856a0]
 ?#6 /usr/sbin/ctdbd(+0x51632) [0x55e82cb85632]
 ?#7 /usr/sbin/ctdbd(+0x54aef) [0x55e82cb88aef]
 ?#8 /usr/sbin/ctdbd(+0x21a25) [0x55e82cb55a25]
 ?#9 /usr/sbin/ctdbd(+0x227c2) [0x55e82cb567c2]
 ?#10 /lib64/libtevent.so.0(tevent_common_invoke_fd_handler+0x95) 
[0x7fee2813c4a5]
 ?#11 /lib64/libtevent.so.0(+0x1055e) [0x7fee2814055e]
 ?#12 /lib64/libtevent.so.0(+0x782b) [0x7fee2813782b]
 ?#13 /lib64/libtevent.so.0(_tevent_loop_once+0x98) [0x7fee28139368]
 ?#14 /lib64/libtevent.so.0(tevent_common_loop_wait+0x1b) [0x7fee2813948b]
 ?#15 /lib64/libtevent.so.0(+0x789b) [0x7fee2813789b]
 ?#16 /usr/sbin/ctdbd(ctdb_start_daemon+0x68a) [0x55e82cb6b2ba]
 ?#17 /usr/sbin/ctdbd(main+0x4fb) [0x55e82cb4a92b]
 ?#18 /lib64/libc.so.6(+0x295d0) [0x7fee27e295d0]
 ?#19 /lib64/libc.so.6(__libc_start_main+0x80) [0x7fee27e29680]
 ?#20 /usr/sbin/ctdbd(_start+0x25) [0x55e82cb4afe5]
2025-10-13T18:41:44.969411-03:00 adm-gw1 ctdb-recoverd[479490]: recovery 
daemon parent died - exiting
2025-10-13T18:41:44.971113-03:00 adm-gw1 ctdb-eventd[479407]: Received 
signal 15
2025-10-13T18:41:44.971154-03:00 adm-gw1 ctdb-eventd[479407]: Shutting down

Regards,

Bailey Allison
Service Team Lead
45Drives, Ltd.
866-594-7199 x868

On 2025-10-13 21:11, Martin Schwenke wrote:> Hi Bailey,
>
> On Mon, 13 Oct 2025 17:58:07 -0300, Bailey Allison via samba
> <samba at lists.samba.org> wrote:
>
>> Anyone have experience using the node address = value in ctdb.conf?
>> Running into the exact issue specified in the docs:
>>
>> node address = IPADDR
>>
>>   ??? IPADDR is the private IP address that ctdbd will bind to.
>>
>>   ??? This option is only required when automatic address detection can
>> not be used. This can be the case when running multiple ctdbd
>> daemons/nodes on the same physical host (usually for testing) or using
>> InfiniBand for the private network. Another unlikely possibility would
>> be running on a platform with a feature like Linux's
>> net.ipv4.ip_nonlocal_bind=1 enabled and no usable getifaddrs(3)
>> implementation (or replacement) available.
>>
>>   ??? Default: CTDB selects the first address from the nodes list that
it
>> can bind to. See also the PRIVATE ADDRESS section in ctdb(7).
>>
>> Specifically the section about net.ipv4_nonlocal=bind=1.
>>
>> When trying to use the node address = IPADDR conf though, it appears
>> nothing is changing. It seems from logs that it isn't even using
the
>> value, and for testing I tried renaming to a garbage value (node
garbage
>> = IPADDR) instead of the proper one, and no difference in the logs.
>>
>> Is it possible the parameter has a different value than specified in
the
>> docs? Also checked man page on system it's installed on and seeing
the
>> same value for it.
>>
>> I know the cause of this issue is resolved in 4.22.x samba, but looking
>> to see if it can also be solved without an upgrade.
> This feature is regularly used in CTDB's "local daemons" test
> environment, where we run multiple daemons on a single machine.
>
> One very basic question: Are you setting "node address" in the
[cluster]
> section of ctdb.conf?  For historical reasons, the configuration
> handling doesn't warn about misplaced (or unknown) options.
>
> If this can't be explained by being in an incorrect section, can you
> please share an example of a ctdb.conf file that isn't working as
> expected?
>
> Thanks...
>
> peace & happiness,
> martin

Martin Schwenke

2025-Oct-15 09:44 UTC

head link

[Samba] CTDB node address = IPADDR in ctdb.conf

Hi Bailey,

[Oops, this time to the list...]

On Tue, 14 Oct 2025 11:12:03 -0300, Bailey Allison via samba
<samba at lists.samba.org> wrote:
> Thanks for reaching out! Sure thing, here is the ctdb.conf in question:
> 
> [legacy]
>  ??? #realtime scheduling = true will cause ctdb to fail when docker 
> containers are running
>  ??? realtime scheduling = false
> 
> [cluster]
>  ??? node address = 192.168.45.230
>  ??? recovery lock = !/usr/libexec/ctdb/ctdb_mutex_ceph_rados_helper 
> ceph client.samba cephfs.cephfs.meta ctdblock
> 
> I did do some messing around and gave it an incorrect IP 
> (192.168.45.2322), and it did error and stop CTDB per the code (due to 
> invalid IP). Just appears when giving an IP address it's not changing 
> the behaviour.
> 
> But perhaps it's my understanding of it that is incorrect.  
Right, having read further below, the "node address" option only
changes the behaviour of how the private node address is decided.
Without that option, ctdbd will attempt to bind in turn to each local
IP address in the node list, until it succeeds.  The changes in
Samba 4.22 added the work "local" to that sentence.

With the "node address" option, the only change is that the specified
IP address is the only one that ctdbd attempts to bind to... for the
private node address...
> To give a bit more detail, we are using the ingress service from 
> cephadm, and CTDB on the same nodes. This ingress service utilizes the 
> sysctl value mentioned, net.ipv4.ip_nonlocal_bind=1.
> 
> What is eventually occurring is CTDB crashing due to being unable to 
> assign the VIP to the interface on the host.  
... but it doesn't affect handling of VIPs at all.
> Once turning the value back to 0, CTDB does function correctly too.
> 
> It may just be that there is another completely separate issue we are 
> running into, but I was just hopeful based on the docs mentioning that 
> specific value it may have just been that.  
No, it is separate issue with no solution in 4.21.

Explanation below...
> Here is some logs from right before the crash too if that helps:  
> [...]
> 2025-10-13T18:41:44.211218-03:00 adm-gw1 ctdbd[479406]: Node became 
> HEALTHY. Ask recovery master to reallocate IPs
> 2025-10-13T18:41:44.732792-03:00 adm-gw1 ctdb-recoverd[479490]: 
> Unassigned IP 192.168.45.235 can be served by this node
> 2025-10-13T18:41:44.732964-03:00 adm-gw1 ctdb-recoverd[479490]: IP 
> 192.168.45.235 incorrectly on an interface  
The IP address isn't assigned to this node, but ctdbd uses bind(2) to
check if the IP address is local (assuming ip_nonlocal_bind=0) and it
can bind, so (considering the assumption) the address must be local.
> 2025-10-13T18:41:44.732987-03:00 adm-gw1 ctdb-recoverd[479490]: Trigger 
> takeoverrun
> 2025-10-13T18:41:44.733160-03:00 adm-gw1 ctdb-recoverd[479490]: Takeover 
> run starting
> 2025-10-13T18:41:44.769369-03:00 adm-gw1 ctdbd[479406]: 
> ../../ctdb/server/ctdb_takeover.c:797 Doing updateip for IP 
> 192.168.45.235 already on an interface
> 2025-10-13T18:41:44.769448-03:00 adm-gw1 ctdbd[479406]: Update of IP 
> 192.168.45.235/16 from interface __none__ to ens18  
ctdbd decides that since the address is local, it has to do an
"updateip" instead of a "takeip" to make the intended
change.
> 2025-10-13T18:41:44.788619-03:00 adm-gw1 ctdb-eventd[479407]: 
> 10.interface: ERROR: Unable to determine interface for IP 192.168.45.235
> 2025-10-13T18:41:44.788689-03:00 adm-gw1 ctdb-eventd[479407]: updateip 
> event failed
> 2025-10-13T18:41:44.788847-03:00 adm-gw1 ctdbd[479406]: Failed update of 
> IP 192.168.45.235 from interface __none__ to ens18  
However, the 10.interface event script can't find an interface with the
IP address assigned, so it fails.
> 2025-10-13T18:41:44.788945-03:00 adm-gw1 ctdbd[479406]: 
> ==============================================================>
2025-10-13T18:41:44.788966-03:00 adm-gw1 ctdbd[479406]: INTERNAL ERROR:
> Signal 11: Segmentation fault in? () () pid 479406 (4.21.3)
> 2025-10-13T18:41:44.788985-03:00 adm-gw1 ctdbd[479406]: If you are 
> running a recent Samba version, and if you think this problem is not yet 
> fixed in the latest versions, please consider reporting this bug, see 
> https://wiki.samba.org/index.php/Bug_Reporting
> 2025-10-13T18:41:44.789003-03:00 adm-gw1 ctdbd[479406]: 
> ==============================================================>
2025-10-13T18:41:44.789016-03:00 adm-gw1 ctdbd[479406]: PANIC (pid
> 479406): Signal 11: Segmentation fault in 4.21.3
> 2025-10-13T18:41:44.789489-03:00 adm-gw1 ctdbd[479406]: BACKTRACE: 21 
> stack frames:  
The stack trace isn't useful but, at a guess, it crashes here:

		/*
		 * All we can do is reset the old interface
		 * and let the next run fix it
		 */
		ctdb_vnn_unassign_iface(ctdb, state->vnn);
		state->vnn->iface = state->old;
		state->vnn->iface->references++;

This is because state->old is NULL.

That bug is still there.  However, it should no longer happen on Linux (and
possibly other platforms) in CTDB >= 4.22 because the check for an IP
address no longer (only) depends on bind(2).

There are a few choices for how to fix it:

1. Avoid dereferencing state->vnn->iface when it is NULL

2. Try to do something clever to avoid the "updateip" - but if we have
   unreliable local IP checking, then we might really want to remove
   that IP

3. Change the "updateip" logic so that if the old interface is
   "__none__" and the IP address is not on an interface, it
doesn't
   fail and bypasses trying to remove the IP.

   The easiest fix is to change get_iface_ip_maskbits() in
   10.interface.script.  In the else, before the call to die(), if
   "$_iface_in" = "__none__" then set
iface="__none__" and return.
   Then up a level in the "updateip" case, only try to
   delete_ip_from_iface "$oiface" ... if "$oiface" !=
"__none__".  It
   is a bit hacky and ugly, but it is OK... and completely untested.
   :-)

(1) isn't enough because it will just loop, retrying.  So, it'll have
to be (1) and (3).

Summary:

* You can't do what you want to do in CTDB 4.21.  You will need to
  upgrade to CTDB 4.22.  Sorry...

  Well, or you could take a fix for (3) above (either my fix, when
  done - or one of your own... which you can submit) and hack that into
  your local copy of 10.interface.script.  It will probably work.  ;-)
  If you make it work, please feel free to submit it.  There may be 1
  last bug fix release for 4.21 if I interpret correctly.

* I have a bug to fix, unless you fix it first.  If you fix it first
  then there is no use running CI, since it won't exercise this code,
  unless we add a unit test.

  If I do the fix, are you OK with being credited in the commit?

    Reported-by: Bailey Allison <ballison at 45drives.com>

Thanks...

peace & happiness,
martin

samba - Oct 2025 - CTDB node address = IPADDR in ctdb.conf

[Samba] CTDB node address = IPADDR in ctdb.conf

[Samba] CTDB node address = IPADDR in ctdb.conf