Hi Rowland, On 18/07/19 15:52, Rowland penny via samba wrote:> my plan would be to: > > TURN OFF DC2I did it on Friday afternoon after my numerous attempts to demote DC2 failed. This fixed one issue - made the network shares appear again across all clients. A new one has been discovered though on one of our CentOS 5.11 boxes. Any command (like sudo or ssh) that needs authentication or user name lookup takes a long time to complete. This doesn't only make working with this machine very difficult but also makes lots of complex scripts to fail due to timeouts. Even though DC2 (192.168.8.125) has been powered off for almost 3 days I can still see this client trying to connect to it when I ssh from another terminal: [root at centos log]# lsof | grep 192.168.8.125 sshd????? 6630????? root??? 7u???? IPv4????????????? 24776 0t0??????? TCP centos.company.co.uk:57423->192.168.8.125:ldap (SYN_SENT) sshd????? 6642????? root??? 7u???? IPv4????????????? 24812 0t0??????? TCP centos.company.co.uk:57425->192.168.8.125:ldap (SYN_SENT) At the same time I can see a lot of successful TCP flags (ESTABLISHED, CLOSE_WAIT) against DC1. Since no configuration changes have been made on this CentOS box I'm assuming it must be DC1 advertising DC2 to clients. Is removing references to DC2 from DC1 the only option to resolve it or are there any quick tricks available to try? E.g. some cache still needs to expire or needs to be forced to do so.> > Remove any trace of DC2 from DC1I'm assuming I need to try exactly the same thing as last time? ldbedit -e vim -H /var/lib/samba/private/sam.ldb --cross-ncs Any difference running it with samba running vs samba stopped? Apart from DDNS updates there should be no modifications made to AD during the edit process (e.g. no machines or users added, removed, no password changed etc.).> > Run 'samba-tool dbcheck --fix --yes --cross-ncs' > > Hopefully this will fix DC1, but your Samba is that old, I cannot > remember if that will run on your DC. > > Your main problem is that your DC is in production, that is why I said > to back everything up before you start.I've skimmed through: https://wiki.samba.org/index.php/Back_up_and_Restoring_a_Samba_AD_DC and my understanding is both online and offline samba-tool backups are only available in the very latest versions 4.9 and 4.10. So the only option I have is a manual data backup. Is it sufficient to back up /var/lib/samba folder (containing *.ldb, sysvol and netlogon) and restore it entirely if a disaster strikes? Any benefit of stopping samba before creating a tarball? Thanks, Adam
On 22/07/2019 12:41, Adam Weremczuk via samba wrote:> Hi Rowland, > > > On 18/07/19 15:52, Rowland penny via samba wrote: >> my plan would be to: >> >> TURN OFF DC2 > I did it on Friday afternoon after my numerous attempts to demote DC2 > failed. > This fixed one issue - made the network shares appear again across all > clients. > A new one has been discovered though on one of our CentOS 5.11 boxes.I am beginning to think we should rename this thread to 'messy network' ;-) I do hope you have 'I must upgrade the dead OS Centos 5' on your to-do list.> Any command (like sudo or ssh) that needs authentication or user name > lookup takes a long time to complete. > This doesn't only make working with this machine very difficult but > also makes lots of complex scripts to fail due to timeouts. > > Even though DC2 (192.168.8.125) has been powered off for almost 3 days > I can still see this client trying to connect to it when I ssh from > another terminal: > > [root at centos log]# lsof | grep 192.168.8.125 > sshd????? 6630????? root??? 7u???? IPv4????????????? 24776 0t0??????? > TCP centos.company.co.uk:57423->192.168.8.125:ldap (SYN_SENT) > sshd????? 6642????? root??? 7u???? IPv4????????????? 24812 0t0??????? > TCP centos.company.co.uk:57425->192.168.8.125:ldap (SYN_SENT) > > At the same time I can see a lot of successful TCP flags (ESTABLISHED, > CLOSE_WAIT) against DC1. > Since no configuration changes have been made on this CentOS box I'm > assuming it must be DC1 advertising DC2 to clients.If DC2 is still in the database, it will be.> Is removing references to DC2 from DC1 the only option to resolve it > or are there any quick tricks available to try? > E.g. some cache still needs to expire or needs to be forced to do so.You could try restarting Samba, this should recreate any caches, but I think you will need to remove DC2. There are two ways of doing this, manually with ldbdel etc or starting climbing the Samba versions until you get to a point that you can backup everything and be able to run the demote with '--remove-other-dead-server' Rowland> >> >> Remove any trace of DC2 from DC1 > I'm assuming I need to try exactly the same thing as last time? > > ldbedit -e vim -H /var/lib/samba/private/sam.ldb --cross-ncs > > Any difference running it with samba running vs samba stopped? > Apart from DDNS updates there should be no modifications made to AD > during the edit process (e.g. no machines or users added, removed, no > password changed etc.). >> >> Run 'samba-tool dbcheck --fix --yes --cross-ncs' >> >> Hopefully this will fix DC1, but your Samba is that old, I cannot >> remember if that will run on your DC. >> >> Your main problem is that your DC is in production, that is why I >> said to back everything up before you start. > I've skimmed through: > https://wiki.samba.org/index.php/Back_up_and_Restoring_a_Samba_AD_DC > and my understanding is both online and offline samba-tool backups are > only available in the very latest versions 4.9 and 4.10. > So the only option I have is a manual data backup. > Is it sufficient to back up /var/lib/samba folder (containing *.ldb, > sysvol and netlogon) and restore it entirely if a disaster strikes? > Any benefit of stopping samba before creating a tarball? > > Thanks, > Adam >
Hi Rowland, I've decided to roll back samba on DC1 to the state from a couple of weeks ago, before I started all this mess... Since the email subject change :) Stopped bind9 and sernet-samba-ad and copied /var/lib/samba aside. Restored samba folder from backup, started sernet-samba-ad but bind9 fails to start: Jul 22 14:39:39 dc1 named[27846]: generating session key for dynamic DNS Jul 22 14:39:39 dc1 named[27846]: sizing zone task pool based on 5 zones Jul 22 14:39:39 dc1 named[27846]: Loading 'AD DNS Zone' using driver dlopen Jul 22 14:39:39 dc1 named[27846]: samba_dlz: Failed to connect to /var/lib/samba/private/dns/sam.ldb Jul 22 14:39:39 dc1 named[27846]: dlz_dlopen of 'AD DNS Zone' failed Jul 22 14:39:39 dc1 named[27846]: SDLZ driver failed to load. Jul 22 14:39:39 dc1 named[27846]: DLZ driver failed to load. Jul 22 14:39:39 dc1 named[27846]: loading configuration: failure Jul 22 14:39:39 dc1 named[27846]: exiting (due to fatal error) Initially I thought permissions / ownership issues but the current and the backup copy looks identical: dc1:/# getfacl var/lib/samba/private/dns/sam.ldb # file: var/lib/samba/private/dns/sam.ldb # owner: root # group: bind user::rw- group::rw- other::--- dc1:/# getfacl var/tmp/bacula-restores/var/lib/samba/private/dns/sam.ldb # file: var/tmp/bacula-restores/var/lib/samba/private/dns/sam.ldb # owner: root # group: bind user::rw- group::rw- other::--- Files have the same size and time stamps, both last modified in 2013. Also no difference in ownership and permissions for the parent samba/private/dns folders. After rolling back /var/lib/samba and restarting services DNS and AD are working again. Any ideas? Thanks, Adam
On 22/07/19 13:01, Rowland penny via samba wrote:> You could try restarting Samba, this should recreate any caches, but I > think you will need to remove DC2. There are two ways of doing this, > manually with ldbdel etc or starting climbing the Samba versions until > you get to a point that you can backup everything and be able to run > the demote with '--remove-other-dead-server' > > RowlandMy restore has been successful and I'm back to square one - a clean single DC not trying to replicate anywhere. The only place I can still see DC2 being referred to in Active Directory Users and Computers -> Domain Controllers. When I try to delete it from the list I'm presented with attached prompt. Is it safe to proceed and try using Windows AD tools to complete the cleanup?