Hey Martin, Thanks for reaching out! Sure thing, here is the ctdb.conf in question: [legacy] ??? #realtime scheduling = true will cause ctdb to fail when docker containers are running ??? realtime scheduling = false [cluster] ??? node address = 192.168.45.230 ??? recovery lock = !/usr/libexec/ctdb/ctdb_mutex_ceph_rados_helper ceph client.samba cephfs.cephfs.meta ctdblock I did do some messing around and gave it an incorrect IP (192.168.45.2322), and it did error and stop CTDB per the code (due to invalid IP). Just appears when giving an IP address it's not changing the behaviour. But perhaps it's my understanding of it that is incorrect. To give a bit more detail, we are using the ingress service from cephadm, and CTDB on the same nodes. This ingress service utilizes the sysctl value mentioned, net.ipv4.ip_nonlocal_bind=1. What is eventually occurring is CTDB crashing due to being unable to assign the VIP to the interface on the host. Once turning the value back to 0, CTDB does function correctly too. It may just be that there is another completely separate issue we are running into, but I was just hopeful based on the docs mentioning that specific value it may have just been that. Here is some logs from right before the crash too if that helps: 2025-10-13T18:41:42.058594-03:00 adm-gw1 ctdbd[479406]: startup event OK - enabling monitoring 2025-10-13T18:41:42.058654-03:00 adm-gw1 ctdbd[479406]: Set runstate to RUNNING (5) 2025-10-13T18:41:42.097671-03:00 adm-gw1 ctdb-recoverd[479490]: Takeover run completed successfully 2025-10-13T18:41:42.730948-03:00 adm-gw1 ctdb-recoverd[479490]: IP 192.168.45.235 incorrectly on an interface 2025-10-13T18:41:42.731016-03:00 adm-gw1 ctdb-recoverd[479490]: Trigger takeoverrun 2025-10-13T18:41:42.731208-03:00 adm-gw1 ctdb-recoverd[479490]: Takeover run starting 2025-10-13T18:41:42.742354-03:00 adm-gw1 ctdb-takeover[479783]: No nodes available to host public IPs yet 2025-10-13T18:41:42.787980-03:00 adm-gw1 ctdb-recoverd[479490]: Takeover run completed successfully 2025-10-13T18:41:43.732009-03:00 adm-gw1 ctdb-recoverd[479490]: IP 192.168.45.235 incorrectly on an interface 2025-10-13T18:41:43.732077-03:00 adm-gw1 ctdb-recoverd[479490]: Trigger takeoverrun 2025-10-13T18:41:43.732294-03:00 adm-gw1 ctdb-recoverd[479490]: Takeover run starting 2025-10-13T18:41:43.745921-03:00 adm-gw1 ctdb-takeover[479794]: No nodes available to host public IPs yet 2025-10-13T18:41:43.796166-03:00 adm-gw1 ctdb-recoverd[479490]: Takeover run completed successfully 2025-10-13T18:41:44.210461-03:00 adm-gw1 ctdbd[479406]: monitor event OK - node re-enabled 2025-10-13T18:41:44.211218-03:00 adm-gw1 ctdbd[479406]: Node became HEALTHY. Ask recovery master to reallocate IPs 2025-10-13T18:41:44.732792-03:00 adm-gw1 ctdb-recoverd[479490]: Unassigned IP 192.168.45.235 can be served by this node 2025-10-13T18:41:44.732964-03:00 adm-gw1 ctdb-recoverd[479490]: IP 192.168.45.235 incorrectly on an interface 2025-10-13T18:41:44.732987-03:00 adm-gw1 ctdb-recoverd[479490]: Trigger takeoverrun 2025-10-13T18:41:44.733160-03:00 adm-gw1 ctdb-recoverd[479490]: Takeover run starting 2025-10-13T18:41:44.769369-03:00 adm-gw1 ctdbd[479406]: ../../ctdb/server/ctdb_takeover.c:797 Doing updateip for IP 192.168.45.235 already on an interface 2025-10-13T18:41:44.769448-03:00 adm-gw1 ctdbd[479406]: Update of IP 192.168.45.235/16 from interface __none__ to ens18 2025-10-13T18:41:44.788619-03:00 adm-gw1 ctdb-eventd[479407]: 10.interface: ERROR: Unable to determine interface for IP 192.168.45.235 2025-10-13T18:41:44.788689-03:00 adm-gw1 ctdb-eventd[479407]: updateip event failed 2025-10-13T18:41:44.788847-03:00 adm-gw1 ctdbd[479406]: Failed update of IP 192.168.45.235 from interface __none__ to ens18 2025-10-13T18:41:44.788945-03:00 adm-gw1 ctdbd[479406]: ==============================================================2025-10-13T18:41:44.788966-03:00 adm-gw1 ctdbd[479406]: INTERNAL ERROR: Signal 11: Segmentation fault in? () () pid 479406 (4.21.3) 2025-10-13T18:41:44.788985-03:00 adm-gw1 ctdbd[479406]: If you are running a recent Samba version, and if you think this problem is not yet fixed in the latest versions, please consider reporting this bug, see https://wiki.samba.org/index.php/Bug_Reporting 2025-10-13T18:41:44.789003-03:00 adm-gw1 ctdbd[479406]: ==============================================================2025-10-13T18:41:44.789016-03:00 adm-gw1 ctdbd[479406]: PANIC (pid 479406): Signal 11: Segmentation fault in 4.21.3 2025-10-13T18:41:44.789489-03:00 adm-gw1 ctdbd[479406]: BACKTRACE: 21 stack frames: ?#0 /usr/lib64/samba/libgenrand-private-samba.so(log_stack_trace+0x34) [0x7fee28193624] ?#1 /usr/lib64/samba/libgenrand-private-samba.so(smb_panic+0xd) [0x7fee28193e0d] ?#2 /usr/lib64/samba/libgenrand-private-samba.so(+0x2fd8) [0x7fee28193fd8] ?#3 /lib64/libc.so.6(+0x3ebf0) [0x7fee27e3ebf0] ?#4 /usr/sbin/ctdbd(+0x563c7) [0x55e82cb8a3c7] ?#5 /usr/sbin/ctdbd(+0x516a0) [0x55e82cb856a0] ?#6 /usr/sbin/ctdbd(+0x51632) [0x55e82cb85632] ?#7 /usr/sbin/ctdbd(+0x54aef) [0x55e82cb88aef] ?#8 /usr/sbin/ctdbd(+0x21a25) [0x55e82cb55a25] ?#9 /usr/sbin/ctdbd(+0x227c2) [0x55e82cb567c2] ?#10 /lib64/libtevent.so.0(tevent_common_invoke_fd_handler+0x95) [0x7fee2813c4a5] ?#11 /lib64/libtevent.so.0(+0x1055e) [0x7fee2814055e] ?#12 /lib64/libtevent.so.0(+0x782b) [0x7fee2813782b] ?#13 /lib64/libtevent.so.0(_tevent_loop_once+0x98) [0x7fee28139368] ?#14 /lib64/libtevent.so.0(tevent_common_loop_wait+0x1b) [0x7fee2813948b] ?#15 /lib64/libtevent.so.0(+0x789b) [0x7fee2813789b] ?#16 /usr/sbin/ctdbd(ctdb_start_daemon+0x68a) [0x55e82cb6b2ba] ?#17 /usr/sbin/ctdbd(main+0x4fb) [0x55e82cb4a92b] ?#18 /lib64/libc.so.6(+0x295d0) [0x7fee27e295d0] ?#19 /lib64/libc.so.6(__libc_start_main+0x80) [0x7fee27e29680] ?#20 /usr/sbin/ctdbd(_start+0x25) [0x55e82cb4afe5] 2025-10-13T18:41:44.969411-03:00 adm-gw1 ctdb-recoverd[479490]: recovery daemon parent died - exiting 2025-10-13T18:41:44.971113-03:00 adm-gw1 ctdb-eventd[479407]: Received signal 15 2025-10-13T18:41:44.971154-03:00 adm-gw1 ctdb-eventd[479407]: Shutting down Regards, Bailey Allison Service Team Lead 45Drives, Ltd. 866-594-7199 x868 On 2025-10-13 21:11, Martin Schwenke wrote:> Hi Bailey, > > On Mon, 13 Oct 2025 17:58:07 -0300, Bailey Allison via samba > <samba at lists.samba.org> wrote: > >> Anyone have experience using the node address = value in ctdb.conf? >> Running into the exact issue specified in the docs: >> >> node address = IPADDR >> >> ??? IPADDR is the private IP address that ctdbd will bind to. >> >> ??? This option is only required when automatic address detection can >> not be used. This can be the case when running multiple ctdbd >> daemons/nodes on the same physical host (usually for testing) or using >> InfiniBand for the private network. Another unlikely possibility would >> be running on a platform with a feature like Linux's >> net.ipv4.ip_nonlocal_bind=1 enabled and no usable getifaddrs(3) >> implementation (or replacement) available. >> >> ??? Default: CTDB selects the first address from the nodes list that it >> can bind to. See also the PRIVATE ADDRESS section in ctdb(7). >> >> Specifically the section about net.ipv4_nonlocal=bind=1. >> >> When trying to use the node address = IPADDR conf though, it appears >> nothing is changing. It seems from logs that it isn't even using the >> value, and for testing I tried renaming to a garbage value (node garbage >> = IPADDR) instead of the proper one, and no difference in the logs. >> >> Is it possible the parameter has a different value than specified in the >> docs? Also checked man page on system it's installed on and seeing the >> same value for it. >> >> I know the cause of this issue is resolved in 4.22.x samba, but looking >> to see if it can also be solved without an upgrade. > This feature is regularly used in CTDB's "local daemons" test > environment, where we run multiple daemons on a single machine. > > One very basic question: Are you setting "node address" in the [cluster] > section of ctdb.conf? For historical reasons, the configuration > handling doesn't warn about misplaced (or unknown) options. > > If this can't be explained by being in an incorrect section, can you > please share an example of a ctdb.conf file that isn't working as > expected? > > Thanks... > > peace & happiness, > martin
Hi Bailey, [Oops, this time to the list...] On Tue, 14 Oct 2025 11:12:03 -0300, Bailey Allison via samba <samba at lists.samba.org> wrote:> Thanks for reaching out! Sure thing, here is the ctdb.conf in question: > > [legacy] > ??? #realtime scheduling = true will cause ctdb to fail when docker > containers are running > ??? realtime scheduling = false > > [cluster] > ??? node address = 192.168.45.230 > ??? recovery lock = !/usr/libexec/ctdb/ctdb_mutex_ceph_rados_helper > ceph client.samba cephfs.cephfs.meta ctdblock > > I did do some messing around and gave it an incorrect IP > (192.168.45.2322), and it did error and stop CTDB per the code (due to > invalid IP). Just appears when giving an IP address it's not changing > the behaviour. > > But perhaps it's my understanding of it that is incorrect.Right, having read further below, the "node address" option only changes the behaviour of how the private node address is decided. Without that option, ctdbd will attempt to bind in turn to each local IP address in the node list, until it succeeds. The changes in Samba 4.22 added the work "local" to that sentence. With the "node address" option, the only change is that the specified IP address is the only one that ctdbd attempts to bind to... for the private node address...> To give a bit more detail, we are using the ingress service from > cephadm, and CTDB on the same nodes. This ingress service utilizes the > sysctl value mentioned, net.ipv4.ip_nonlocal_bind=1. > > What is eventually occurring is CTDB crashing due to being unable to > assign the VIP to the interface on the host.... but it doesn't affect handling of VIPs at all.> Once turning the value back to 0, CTDB does function correctly too. > > It may just be that there is another completely separate issue we are > running into, but I was just hopeful based on the docs mentioning that > specific value it may have just been that.No, it is separate issue with no solution in 4.21. Explanation below...> Here is some logs from right before the crash too if that helps:> [...] > 2025-10-13T18:41:44.211218-03:00 adm-gw1 ctdbd[479406]: Node became > HEALTHY. Ask recovery master to reallocate IPs > 2025-10-13T18:41:44.732792-03:00 adm-gw1 ctdb-recoverd[479490]: > Unassigned IP 192.168.45.235 can be served by this node > 2025-10-13T18:41:44.732964-03:00 adm-gw1 ctdb-recoverd[479490]: IP > 192.168.45.235 incorrectly on an interfaceThe IP address isn't assigned to this node, but ctdbd uses bind(2) to check if the IP address is local (assuming ip_nonlocal_bind=0) and it can bind, so (considering the assumption) the address must be local.> 2025-10-13T18:41:44.732987-03:00 adm-gw1 ctdb-recoverd[479490]: Trigger > takeoverrun > 2025-10-13T18:41:44.733160-03:00 adm-gw1 ctdb-recoverd[479490]: Takeover > run starting > 2025-10-13T18:41:44.769369-03:00 adm-gw1 ctdbd[479406]: > ../../ctdb/server/ctdb_takeover.c:797 Doing updateip for IP > 192.168.45.235 already on an interface > 2025-10-13T18:41:44.769448-03:00 adm-gw1 ctdbd[479406]: Update of IP > 192.168.45.235/16 from interface __none__ to ens18ctdbd decides that since the address is local, it has to do an "updateip" instead of a "takeip" to make the intended change.> 2025-10-13T18:41:44.788619-03:00 adm-gw1 ctdb-eventd[479407]: > 10.interface: ERROR: Unable to determine interface for IP 192.168.45.235 > 2025-10-13T18:41:44.788689-03:00 adm-gw1 ctdb-eventd[479407]: updateip > event failed > 2025-10-13T18:41:44.788847-03:00 adm-gw1 ctdbd[479406]: Failed update of > IP 192.168.45.235 from interface __none__ to ens18However, the 10.interface event script can't find an interface with the IP address assigned, so it fails.> 2025-10-13T18:41:44.788945-03:00 adm-gw1 ctdbd[479406]: > ==============================================================> 2025-10-13T18:41:44.788966-03:00 adm-gw1 ctdbd[479406]: INTERNAL ERROR: > Signal 11: Segmentation fault in? () () pid 479406 (4.21.3) > 2025-10-13T18:41:44.788985-03:00 adm-gw1 ctdbd[479406]: If you are > running a recent Samba version, and if you think this problem is not yet > fixed in the latest versions, please consider reporting this bug, see > https://wiki.samba.org/index.php/Bug_Reporting > 2025-10-13T18:41:44.789003-03:00 adm-gw1 ctdbd[479406]: > ==============================================================> 2025-10-13T18:41:44.789016-03:00 adm-gw1 ctdbd[479406]: PANIC (pid > 479406): Signal 11: Segmentation fault in 4.21.3 > 2025-10-13T18:41:44.789489-03:00 adm-gw1 ctdbd[479406]: BACKTRACE: 21 > stack frames:The stack trace isn't useful but, at a guess, it crashes here: /* * All we can do is reset the old interface * and let the next run fix it */ ctdb_vnn_unassign_iface(ctdb, state->vnn); state->vnn->iface = state->old; state->vnn->iface->references++; This is because state->old is NULL. That bug is still there. However, it should no longer happen on Linux (and possibly other platforms) in CTDB >= 4.22 because the check for an IP address no longer (only) depends on bind(2). There are a few choices for how to fix it: 1. Avoid dereferencing state->vnn->iface when it is NULL 2. Try to do something clever to avoid the "updateip" - but if we have unreliable local IP checking, then we might really want to remove that IP 3. Change the "updateip" logic so that if the old interface is "__none__" and the IP address is not on an interface, it doesn't fail and bypasses trying to remove the IP. The easiest fix is to change get_iface_ip_maskbits() in 10.interface.script. In the else, before the call to die(), if "$_iface_in" = "__none__" then set iface="__none__" and return. Then up a level in the "updateip" case, only try to delete_ip_from_iface "$oiface" ... if "$oiface" != "__none__". It is a bit hacky and ugly, but it is OK... and completely untested. :-) (1) isn't enough because it will just loop, retrying. So, it'll have to be (1) and (3). Summary: * You can't do what you want to do in CTDB 4.21. You will need to upgrade to CTDB 4.22. Sorry... Well, or you could take a fix for (3) above (either my fix, when done - or one of your own... which you can submit) and hack that into your local copy of 10.interface.script. It will probably work. ;-) If you make it work, please feel free to submit it. There may be 1 last bug fix release for 4.21 if I interpret correctly. * I have a bug to fix, unless you fix it first. If you fix it first then there is no use running CI, since it won't exercise this code, unless we add a unit test. If I do the fix, are you OK with being credited in the commit? Reported-by: Bailey Allison <ballison at 45drives.com> Thanks... peace & happiness, martin