Sage Weil
2024-May-03 21:17 UTC
[Samba] samba failover with ctdb and client-visible errors
Hi everyone, I'm setting up a clustered Samba+CTDB in front of CephFS and am running into an issue during failover. For the most part everything seems to work: the IP moves quickly, smbd is started on the right node, etc, but if there is an IO load from a client during failover (e.g., copying a big directory full of files in File Explorer), it pauses for a couple of seconds and then pops up an error dialog box. If I hit 'Try Again' everything continues without problems. However... I assume that a client-visible error like this will cause problems with most applications (that may not be persistent enough to retry everything). I did a google search and the only thing I found was something suggesting passing a flag to xcopy that forces a retry on error. Here's what the dialog looks like when I reboot one of the gateway nodes: https://i.ibb.co/kh4fFPW/tryagain.png If I click 'Try Again' everything proceeds. Here's my smb.conf: root at smbgw2:/etc/samba# cat smb.conf [global] clustering = yes include = registry root at smbgw2:/etc/samba# net conf list [global] netbios name = smbgw clustering = yes idmap config * : backend = tdb2 passdb backend = tdbsam load printers = no smbd: backgroundqueue = no [Audio] path = /mnt/audio read only = no oplocks = no kernel share modes = no CTDB config looks like so: # See ctdb.conf(5) for documentation # # See ctdb-script.options(5) for documentation about event script # options [logging] # Enable logging to syslog location = syslog # Default log level log level = NOTICE [cluster] # Shared recovery lock file to avoid split brain. Daemon # default is no recovery lock. Do NOT run CTDB without a # recovery lock file unless you know exactly what you are # doing. # # Please see the RECOVERY LOCK section in ctdb(7) for more # details. # # recovery lock = !/bin/false RECOVERY LOCK NOT CONFIGURED recovery lock = /mnt/audio/.ctdb/recovery_lock ^ /mnt/audio is the CephFS mount I am reexporting. CTDB has a single IP in public_addresses that is moving around between the gateway nodes as expected--from what I can tell that is all working well. The only other issue I've identified is that I seem to have to create the user (and set the password with smbpasswd) on each of the gateways... even though I expected that the 'passdb backend = tdbsam' line would keep user and password info in ctdb somewhere. Am I missing something there? Thanks! sage
Martin Schwenke
2024-May-04 02:05 UTC
[Samba] samba failover with ctdb and client-visible errors
Hi Sage, On Fri, 3 May 2024 16:17:45 -0500, Sage Weil via samba <samba at lists.samba.org> wrote:> I'm setting up a clustered Samba+CTDB in front of CephFS and am > running into an issue during failover. For the most part everything > seems to work: the IP moves quickly, smbd is started on the right > node, etc, but if there is an IO load from a client during failover > (e.g., copying a big directory full of files in File Explorer), it > pauses for a couple of seconds and then pops up an error dialog box. > If I hit 'Try Again' everything continues without problems. > However... I assume that a client-visible error like this will cause > problems with most applications (that may not be persistent enough to > retry everything). I did a google search and the only thing I found > was something suggesting passing a flag to xcopy that forces a retry > on error. > > Here's what the dialog looks like when I reboot one of the gateway nodes: > https://i.ibb.co/kh4fFPW/tryagain.png > If I click 'Try Again' everything proceeds.Error handling seems to be application-dependent on Windows. If you're doing lots of copying then the hint you found for xcopy is probably a good idea. Many applications will silently reconnect. One issue is that CTDB's failover is done at the TCP networking level, so it is impossible to hide errors from applications. The dream is to get transparent failover with Microsoft's Witness Protocol (available in Samba ? 4.20) and persistent file handles (not yet in Samba).> Here's my smb.conf: > > root at smbgw2:/etc/samba# cat smb.conf > [global] > clustering = yes > include = registry > root at smbgw2:/etc/samba# net conf list > [global] > netbios name = smbgw > clustering = yes > idmap config * : backend = tdb2For default domain ID mapping, you probably want autorid these days: https://www.samba.org/samba/docs/current/man-html/idmap_autorid.8.html> [...] > CTDB config looks like so:> CTDB has a single IP in public_addresses that is moving around between > the gateway nodes as expected--from what I can tell that is all > working well.If CephFS is sane (i.e. has proper locking coherency - others will be able to make better comments about this) then clustered Samba can happily be active-active, so you can multiple IPs in public_addresses, so multiple clients can access via different gateway nodes in parallel.> The only other issue I've identified is that I seem to have to create > the user (and set the password with smbpasswd) on each of the > gateways... even though I expected that the 'passdb backend = tdbsam' > line would keep user and password info in ctdb somewhere. Am I > missing something there?There currently isn't a way of exposing local users at the OS level, and an OS user is needed for file permissions. We have thought of faking this via winbind, but it keeps sliding down the priority queue. Setting up a Samba Active Directory server isn't especially difficult, so tends to be a good option. I hope some of that is useful... :-) peace & happiness, martin