Hello, I'm running Samba 4.5.0 and bind-9.8.2-0.47.rc1.el6_8.1. One DC of four, the PDC, is magnitudes slower running /usr/local/samba/sbin/samba_dnsupdate --verbose --all-names. When that is running on that DC it seems to block any queries. The load average is usually under 0.5. The DC was unsafely halted, which could have corrupted something. I ran a dbcheck with samba-tool and it came back clean other than the expected cleanup after upgrading to 4.5.0. Is there any caches or similar that I could try clearing for BIND? Usually at least once a day the memory increases from the typical ~1 GB of usage to everything the box has, 8 GB physical and 10 GB swap, requiring a forceful restart, so there appears to be a memory leak as well. When memory usage is high, it is from smbd process, which I wouldn't think would have a correlation to BIND. Rather than a memory leak, the blocking seen with DNS queries is also blocking smb clients resulting in a pile of connections and high memory usage? The load under this condition is very high, but that is due to high IO and CPU usage from swapping. I had similar behavior with 4.4.5, but it was fine for the first couple of weeks after upgrade. Thanks, Arthur This e-mail and any attachments may contain CONFIDENTIAL information, including PROTECTED HEALTH INFORMATION. If you are not the intended recipient, any use or disclosure of this information is STRICTLY PROHIBITED; you are requested to delete this e-mail and any attachments, notify the sender immediately, and notify the Mediture Privacy Officer at privacyofficer at mediture.com.
I'm hoping the issue is just load balancing, but I'm not sure. I can't see to get the traffic balanced across two DCs. I ran this script on all Linux nodes to balance the traffic. #!/usr/bin/perl use strict; use warnings; my $primary_name_server; my $random = int(rand(10)); open(my $resolv_conf_fh, '< /etc/resolv.conf') or die("Unable to open /etc/resolv.conf for reading: $!"); while(<$resolv_conf_fh>) { chomp; if ($_ =~ /nameserver (.*)/) { $primary_name_server = $1; last; } } close($resolv_conf_fh); if (! defined($primary_name_server) || $primary_name_server eq '192.168.168.64' || $primary_name_server eq '192.168.168.65') { open(my $resolv_conf_fh, '> /etc/resolv.conf') or die("Unable to open /etc/resolv.conf for writing: $!"); print $resolv_conf_fh "search mediture.dom\n"; print $resolv_conf_fh "options rotate timeout:1\n"; if ($random >= 4) { print $resolv_conf_fh "nameserver 192.168.168.64\n"; print $resolv_conf_fh "nameserver 192.168.168.65\n"; } else { print $resolv_conf_fh "nameserver 192.168.168.65\n"; print $resolv_conf_fh "nameserver 192.168.168.64\n"; } close($resolv_conf_fh); if (-f '/usr/bin/wbinfo') { open(my $krb5_conf_fh, '> /etc/krb5.conf') or die("Unable to open /etc/krb5.conf for writing: $!"); print $krb5_conf_fh q([logging] default = FILE:/var/log/krb5libs.log kdc = FILE:/var/log/krb5kdc.log admin_server = FILE:/var/log/kadmind.log default_realm = MEDITURE.DOM [libdefaults] default_realm = MEDITURE.DOM dns_lookup_realm = false dns_lookup_kdc = false ticket_lifetime = 24h renew_lifetime = 7d forwardable = true default_keytab_name = FILE:/etc/krb5.keytab [realms] MEDITURE.DOM = {); if ($random >= 4) { print $krb5_conf_fh " kdc = dc01.mediture.dom\n"; print $krb5_conf_fh " kdc = dc03.mediture.dom\n"; print $krb5_conf_fh " kdc = dc02.mediture.dom\n"; print $krb5_conf_fh " kdc = dc04.mediture.dom\n"; } else { print $krb5_conf_fh " kdc = dc03.mediture.dom\n"; print $krb5_conf_fh " kdc = dc01.mediture.dom\n"; print $krb5_conf_fh " kdc = dc04.mediture.dom\n"; print $krb5_conf_fh " kdc = dc02.mediture.dom\n"; } print $krb5_conf_fh q( default_realm = MEDITURE.DOM } [domain_realm] mediture.dom = MEDITURE.DOM .mediture.dom = MEDITURE.DOM); close($krb5_conf_fh); open(my $smb_conf_fh, '> /etc/samba/smb.conf') or die("Unable to open /etc/samba/smb.conf for writing: $!"); print $smb_conf_fh q([global] #--authconfig--start-line-- workgroup = MEDITURE password server = ); if ($random >= 4) { print $smb_conf_fh 'dc01.mediture.dom '; print $smb_conf_fh 'dc03.mediture.dom '; print $smb_conf_fh 'dc02.mediture.dom '; print $smb_conf_fh 'dc04.mediture.dom'; } else { print $smb_conf_fh 'dc03.mediture.dom '; print $smb_conf_fh 'dc01.mediture.dom '; print $smb_conf_fh 'dc04.mediture.dom '; print $smb_conf_fh 'dc02.mediture.dom'; } print $smb_conf_fh q( realm = MEDITURE.DOM security = ads template homedir = /home/%U template shell = /bin/bash winbind use default domain = true #--authconfig--end-line-- server string = Samba Server Version %v # logs split per machine log file = /var/log/samba/log.%m # max 50KB per log file, then rotate max log size = 50 passdb backend = tdbsam winbind refresh tickets = yes winbind offline logon = yes winbind use default domain = yes winbind nss info = rfc2307 winbind enum users = yes winbind enum groups = yes winbind nested groups = yes kerberos method = secrets and keytab idmap config *: backend = tdb idmap config *: range = 90000001-100000000 idmap config MEDITURE: backend = ad idmap config MEDITURE: range = 10000-49999 idmap config MEDITURE: schema mode = rfc2307); close($smb_conf_fh); close($resolv_conf_fh); } } I also have AD sites setup and have manually configured SRV records to perform load balancing. $ dig +short srv _ldap._tcp.vsc._sites.dc._msdcs.mediture.dom 0 50 389 dc02.mediture.dom. 0 25 389 dc04.mediture.dom. 0 100 389 dc01.mediture.dom. 0 100 389 dc03.mediture.dom. $ dig +short srv _ldap._tcp.aws._sites.dc._msdcs.mediture.dom 0 25 389 dc02.mediture.dom. 0 100 389 dc04.mediture.dom. 0 50 389 dc01.mediture.dom. 0 50 389 dc03.mediture.dom. $ dig +short srv _ldap._tcp.epo._sites.dc._msdcs.mediture.dom 0 25 389 dc04.mediture.dom. 0 100 389 DC02.mediture.dom. 0 50 389 dc01.mediture.dom. 0 50 389 dc03.mediture.dom. $ dig +short srv _ldap._tcp.Default-First-Site-Name._sites.dc._msdcs.mediture.dom 0 100 389 dc01.mediture.dom. 0 100 389 dc03.mediture.dom. $ dig +short srv _ldap._tcp.vsc._sites.mediture.dom 0 100 389 dc01.mediture.dom. 0 100 389 dc03.mediture.dom. 0 50 389 dc02.mediture.dom. 0 25 389 dc04.mediture.dom. $ dig +short srv _ldap._tcp.aws._sites.mediture.dom 0 100 389 dc04.mediture.dom. 0 50 389 dc01.mediture.dom. 0 50 389 dc03.mediture.dom. 0 25 3268 dc02.mediture.dom. $ dig +short srv _ldap._tcp.epo._sites.mediture.dom 0 25 389 dc04.mediture.dom. 0 100 389 dc02.mediture.dom. 0 50 389 dc01.mediture.dom. 0 50 389 dc03.mediture.dom. $ dig +short srv _ldap._tcp.Default-First-Site-Name._sites.mediture.dom 0 100 389 dc04.mediture.dom. 0 100 389 dc01.mediture.dom. 0 100 389 dc02.mediture.dom. 0 100 389 dc03.mediture.dom. I'm not seeing balanced traffic though. [root at dc01 ~]# netstat -an | grep 445 | grep -c ESTABLISHED 164 [root at dc03 ~]# netstat -an | grep 445 | grep -c ESTABLISHED 10 [root at dc01 ~]# netstat -an | grep 88 | grep -c ESTABLISHED 20 [root at dc03 ~]# netstat -an | grep 88 | grep -c ESTABLISHED 2 [root at dc01 ~]# netstat -an | grep 389 | grep -c ESTABLISHED 175 [root at dc03 ~]# netstat -an | grep 389 | grep -c ESTABLISHED 23 [root at dc01 ~]# netstat -an | grep 636 | grep -c ESTABLISHED 3 [root at dc03 ~]# netstat -an | grep 636 | grep -c ESTABLISHED 7 [root at dc01 ~]# netstat -an | grep 53 | grep -c ESTABLISHED 42 [root at dc03 ~]# netstat -an | grep 53 | grep -c ESTABLISHED 6 I only have a handful of Windows instances joined to the domain at that site, VSC, but over 100 Linux nodes. Thanks, Arthur On 09/29/2016 10:16 AM, Arthur Ramsey wrote:> Hello, > > I'm running Samba 4.5.0 and bind-9.8.2-0.47.rc1.el6_8.1. One DC of > four, the PDC, is magnitudes slower running > /usr/local/samba/sbin/samba_dnsupdate --verbose --all-names. When > that is running on that DC it seems to block any queries. The load > average is usually under 0.5. The DC was unsafely halted, which could > have corrupted something. I ran a dbcheck with samba-tool and it came > back clean other than the expected cleanup after upgrading to 4.5.0. > Is there any caches or similar that I could try clearing for BIND? > Usually at least once a day the memory increases from the typical ~1 > GB of usage to everything the box has, 8 GB physical and 10 GB swap, > requiring a forceful restart, so there appears to be a memory leak as > well. When memory usage is high, it is from smbd process, which I > wouldn't think would have a correlation to BIND. Rather than a memory > leak, the blocking seen with DNS queries is also blocking smb clients > resulting in a pile of connections and high memory usage? The load > under this condition is very high, but that is due to high IO and CPU > usage from swapping. I had similar behavior with 4.4.5, but it was > fine for the first couple of weeks after upgrade. > > Thanks, > ArthurThis e-mail and any attachments may contain CONFIDENTIAL information, including PROTECTED HEALTH INFORMATION. If you are not the intended recipient, any use or disclosure of this information is STRICTLY PROHIBITED; you are requested to delete this e-mail and any attachments, notify the sender immediately, and notify the Mediture Privacy Officer at privacyofficer at mediture.com.
I got core dumps when the issue was happening. Here are the backtraces: http://pastebin.com/N0e2fsSQ. Seems to be TDB contention? Thanks, Arthur On 10/7/2016 11:12 AM, Arthur Ramsey wrote:> > I'm hoping the issue is just load balancing, but I'm not sure. I can't > see to get the traffic balanced across two DCs. > > I ran this script on all Linux nodes to balance the traffic. > > #!/usr/bin/perl > use strict; > use warnings; > > my $primary_name_server; > my $random = int(rand(10)); > > open(my $resolv_conf_fh, '< /etc/resolv.conf') or die("Unable to open /etc/resolv.conf for reading: $!"); > while(<$resolv_conf_fh>) { > chomp; > if ($_ =~ /nameserver (.*)/) { > $primary_name_server = $1; > last; > } > } > close($resolv_conf_fh); > > if (! defined($primary_name_server) || $primary_name_server eq '192.168.168.64' || $primary_name_server eq '192.168.168.65') { > open(my $resolv_conf_fh, '> /etc/resolv.conf') or die("Unable to open /etc/resolv.conf for writing: $!"); > print $resolv_conf_fh "search mediture.dom\n"; > print $resolv_conf_fh "options rotate timeout:1\n"; > if ($random >= 4) { > print $resolv_conf_fh "nameserver 192.168.168.64\n"; > print $resolv_conf_fh "nameserver 192.168.168.65\n"; > } else { > print $resolv_conf_fh "nameserver 192.168.168.65\n"; > print $resolv_conf_fh "nameserver 192.168.168.64\n"; > } > close($resolv_conf_fh); > > if (-f '/usr/bin/wbinfo') { > open(my $krb5_conf_fh, '> /etc/krb5.conf') or die("Unable to open /etc/krb5.conf for writing: $!"); > print $krb5_conf_fh q([logging] > default =FILE:/var/log/krb5libs.log > kdc =FILE:/var/log/krb5kdc.log > admin_server =FILE:/var/log/kadmind.log > default_realm = MEDITURE.DOM > > [libdefaults] > default_realm = MEDITURE.DOM > dns_lookup_realm = false > dns_lookup_kdc = false > ticket_lifetime = 24h > renew_lifetime = 7d > forwardable = true > default_keytab_name =FILE:/etc/krb5.keytab > > [realms] > MEDITURE.DOM = {); > if ($random >= 4) { > print $krb5_conf_fh " kdc = dc01.mediture.dom\n"; > print $krb5_conf_fh " kdc = dc03.mediture.dom\n"; > print $krb5_conf_fh " kdc = dc02.mediture.dom\n"; > print $krb5_conf_fh " kdc = dc04.mediture.dom\n"; > } else { > print $krb5_conf_fh " kdc = dc03.mediture.dom\n"; > print $krb5_conf_fh " kdc = dc01.mediture.dom\n"; > print $krb5_conf_fh " kdc = dc04.mediture.dom\n"; > print $krb5_conf_fh " kdc = dc02.mediture.dom\n"; > } > print $krb5_conf_fh q( default_realm = MEDITURE.DOM > } > > [domain_realm] > mediture.dom = MEDITURE.DOM > .mediture.dom = MEDITURE.DOM); > close($krb5_conf_fh); > > open(my $smb_conf_fh, '> /etc/samba/smb.conf') or die("Unable to open /etc/samba/smb.conf for writing: $!"); > print $smb_conf_fh q([global] > #--authconfig--start-line-- > workgroup = MEDITURE > password server = ); > if ($random >= 4) { > print $smb_conf_fh 'dc01.mediture.dom '; > print $smb_conf_fh 'dc03.mediture.dom '; > print $smb_conf_fh 'dc02.mediture.dom '; > print $smb_conf_fh 'dc04.mediture.dom'; > } else { > print $smb_conf_fh 'dc03.mediture.dom '; > print $smb_conf_fh 'dc01.mediture.dom '; > print $smb_conf_fh 'dc04.mediture.dom '; > print $smb_conf_fh 'dc02.mediture.dom'; > } > print $smb_conf_fh q( > realm = MEDITURE.DOM > security = ads > > template homedir = /home/%U > template shell = /bin/bash > > winbind use default domain = true > > #--authconfig--end-line-- > server string = Samba Server Version %v > > # logs split per machine > log file = /var/log/samba/log.%m > # max 50KB per log file, then rotate > max log size = 50 > > passdb backend = tdbsam > > winbind refresh tickets = yes > winbind offline logon = yes > winbind use default domain = yes > winbind nss info = rfc2307 > winbind enum users = yes > winbind enum groups = yes > winbind nested groups = yes > > kerberos method = secrets and keytab > > idmap config *: backend = tdb > idmap config *: range = 90000001-100000000 > > idmap config MEDITURE: backend = ad > idmap config MEDITURE: range = 10000-49999 > idmap config MEDITURE: schema mode = rfc2307); > close($smb_conf_fh); > close($resolv_conf_fh); > } > } > I also have AD sites setup and have manually configured SRV records to > perform load balancing. > $ dig +short srv _ldap._tcp.vsc._sites.dc._msdcs.mediture.dom > 0 50 389 dc02.mediture.dom. > 0 25 389 dc04.mediture.dom. > 0 100 389 dc01.mediture.dom. > 0 100 389 dc03.mediture.dom. > > $ dig +short srv _ldap._tcp.aws._sites.dc._msdcs.mediture.dom > 0 25 389 dc02.mediture.dom. > 0 100 389 dc04.mediture.dom. > 0 50 389 dc01.mediture.dom. > 0 50 389 dc03.mediture.dom. > > $ dig +short srv _ldap._tcp.epo._sites.dc._msdcs.mediture.dom > 0 25 389 dc04.mediture.dom. > 0 100 389 DC02.mediture.dom. > 0 50 389 dc01.mediture.dom. > 0 50 389 dc03.mediture.dom. > > $ dig +short srv _ldap._tcp.Default-First-Site-Name._sites.dc._msdcs.mediture.dom > 0 100 389 dc01.mediture.dom. > 0 100 389 dc03.mediture.dom. > > $ dig +short srv _ldap._tcp.vsc._sites.mediture.dom > 0 100 389 dc01.mediture.dom. > 0 100 389 dc03.mediture.dom. > 0 50 389 dc02.mediture.dom. > 0 25 389 dc04.mediture.dom. > > $ dig +short srv _ldap._tcp.aws._sites.mediture.dom > 0 100 389 dc04.mediture.dom. > 0 50 389 dc01.mediture.dom. > 0 50 389 dc03.mediture.dom. > 0 25 3268 dc02.mediture.dom. > > $ dig +short srv _ldap._tcp.epo._sites.mediture.dom > 0 25 389 dc04.mediture.dom. > 0 100 389 dc02.mediture.dom. > 0 50 389 dc01.mediture.dom. > 0 50 389 dc03.mediture.dom. > > $ dig +short srv _ldap._tcp.Default-First-Site-Name._sites.mediture.dom > 0 100 389 dc04.mediture.dom. > 0 100 389 dc01.mediture.dom. > 0 100 389 dc02.mediture.dom. > 0 100 389 dc03.mediture.dom. > I'm not seeing balanced traffic though. > [root at dc01 ~]# netstat -an | grep 445 | grep -c ESTABLISHED > 164 > [root at dc03 ~]# netstat -an | grep 445 | grep -c ESTABLISHED > 10 > > [root at dc01 ~]# netstat -an | grep 88 | grep -c ESTABLISHED > 20 > [root at dc03 ~]# netstat -an | grep 88 | grep -c ESTABLISHED > 2 > > [root at dc01 ~]# netstat -an | grep 389 | grep -c ESTABLISHED > 175 > [root at dc03 ~]# netstat -an | grep 389 | grep -c ESTABLISHED > 23 > > [root at dc01 ~]# netstat -an | grep 636 | grep -c ESTABLISHED > 3 > [root at dc03 ~]# netstat -an | grep 636 | grep -c ESTABLISHED > 7 > > [root at dc01 ~]# netstat -an | grep 53 | grep -c ESTABLISHED > 42 > [root at dc03 ~]# netstat -an | grep 53 | grep -c ESTABLISHED > 6 > I only have a handful of Windows instances joined to the domain at > that site, VSC, but over 100 Linux nodes. > > Thanks, > Arthur > > On 09/29/2016 10:16 AM, Arthur Ramsey wrote: >> Hello, >> >> I'm running Samba 4.5.0 and bind-9.8.2-0.47.rc1.el6_8.1. One DC of >> four, the PDC, is magnitudes slower running >> /usr/local/samba/sbin/samba_dnsupdate --verbose --all-names. When >> that is running on that DC it seems to block any queries. The load >> average is usually under 0.5. The DC was unsafely halted, which >> could have corrupted something. I ran a dbcheck with samba-tool and >> it came back clean other than the expected cleanup after upgrading to >> 4.5.0. Is there any caches or similar that I could try clearing for >> BIND? Usually at least once a day the memory increases from the >> typical ~1 GB of usage to everything the box has, 8 GB physical and >> 10 GB swap, requiring a forceful restart, so there appears to be a >> memory leak as well. When memory usage is high, it is from smbd >> process, which I wouldn't think would have a correlation to BIND. >> Rather than a memory leak, the blocking seen with DNS queries is also >> blocking smb clients resulting in a pile of connections and high >> memory usage? The load under this condition is very high, but that is >> due to high IO and CPU usage from swapping. I had similar behavior >> with 4.4.5, but it was fine for the first couple of weeks after upgrade. >> >> Thanks, >> ArthurThis e-mail and any attachments may contain CONFIDENTIAL information, including PROTECTED HEALTH INFORMATION. If you are not the intended recipient, any use or disclosure of this information is STRICTLY PROHIBITED; you are requested to delete this e-mail and any attachments, notify the sender immediately, and notify the Mediture Privacy Officer at privacyofficer at mediture.com.