On 10/12/2020 4:06 AM, Rowland penny via samba wrote:> On 12/10/2020 02:54, Jason Keltz via samba wrote: >> I've been working on a Samba AD setup with a bunch of test machines - >> the one DC, and a bunch of clients. Last night, I ended up switching >> the name of the test machines temporarily (except the DC), and >> re-joining the domain (that's for another e-mail later). When things >> didn't work the way I had planned,? I switched the hostnames back, >> and re-joined the domain today on all the test machines.? I was >> shocked to find that I am only able to login to the domain on one of >> my hosts. It fails on all the other ones.? I ensured that I deleted >> the machine entries from AD.? I haven't changed my Samba config in >> months which Rowland had last verified was fine.? I haven't changed >> my /etc/krb5.conf Kerberos config in months.? I even did a complete >> rebuild of one of the machines since I automated the installation >> process, and that rebuild was working perfectly many many times, but >> now it is failed.? In winbind log every time I try to login I'm >> mostly seeing: > > Did you leave the domain before you changed the hostname ? > > Why did you change the hostnames ? In a case like this, I would have > set up a new computer, joined this to the domain and then removed the > old computer from the domain.Hi Rowland, I did not leave the domain, but I did delete the entry by either the Windows AD tool or "samba-tool computer delete" option.? I can't remember which one at this point.? I think that clears up all the bits.? Is that correct?? On the local host, I also deleted the /etc/krb5.keytab, and deleted all the samba bits so that the join was fresh. Things are better today.? I discovered one issue which seemingly unrelated (to me) to the errors seems to have been the cause of a lot of the trouble.? I was chasing errors in winbind log, but several of the test servers are NFS servers, and when I rejoined them to the domain, I didn't replace the nfs/X entries in their keytab.? Now, the clients couldn't mount, and that definately caused some trouble, for which I didn't see the signs.? I'm still watching though. However, I can login to all the hosts now. By the way, at one point, I rebooted the DC, and I noticed that all the AD clients showed something like this: [2020/10/12 09:25:19.183616,? 1, pid=36145, effective(0, 0), real(0, 0)] ../../source3/rpc_client/cli_pipe.c:422(cli_pipe_validate_current_pdu) ? ../../source3/rpc_client/cli_pipe.c:422: Bind NACK received from host dc1.ad.eecs.yorku.ca! [2020/10/12 09:44:11.598150,? 1, pid=36145, effective(0, 0), real(0, 0)] ../../source3/libads/ldap_utils.c:93(ads_do_search_retry_internal) ? Reducing LDAP page size from 1000 to 500 due to IO_TIMEOUT (Which is strange because this means that if you reboot he DC, then the clients start talking slower to it when it comes back up?? I don't think the number ever increases unless you restart winbind everywhere?) and since that reboot, I've seen a few of them do this: [2020/10/12 10:00:19.814381,? 1, pid=36145, effective(0, 0), real(0, 0)] ../../source3/libads/ldap_utils.c:93(ads_do_search_retry_internal) ? Reducing LDAP page size from 500 to 250 due to IO_TIMEOUT [2020/10/12 10:16:19.557261,? 1, pid=36145, effective(0, 0), real(0, 0)] ../../source3/libads/ldap_utils.c:93(ads_do_search_retry_internal) ? Reducing LDAP page size from 250 to 125 due to IO_TIMEOUT Two of them are virtualbox VMs, so I figured maybe it's some kind of virtualbox thing, but one of them is an actual machine and still has the same error.? The DC is very lightly loaded.? How would I debug what is causing this reduction in IO? I know that various errors in the Samba logs are not "issues" but this one seems to be an issue.? I don't like seeing IO_TIMEOUTs. Another distracting error in the log included: [2020/10/11 22:43:29.843630,? 1, pid=969, effective(0, 0), real(0, 0)] ../../source3/libads/ldap.c:565(ads_find_dc) ? ads_find_dc: name resolution for realm 'AD.EECS.YORKU.CA' (domain 'EECSYORKUCA') failed: NT_STATUS_NO_LOGON_SERVERS ... after boot which sounds serious but it turns out if I try to authenticate before everything is up and running, that's what I get. The error makes sense but there's no "follow up" to say: "Ok ok - I found it now - Sorry to give you a heart attack.".? It's all a learning experience. The real reason I was trying to change the hostnames was to deal with a scenario particular of our environment.? We have many dualboot machines? running Windows and Linux.? I know that I can't join the domain with the same name on both Linux and Windows systems because joining one would change the password, then the other wouldn't be joined, etc.? I understand that it's possible to generate a machine password manually, and use that from both sides, but as I understand it, this interferes with the systems ability to change the machine password regularly which seems more secure.? I don't know if Samba does that. ? I also don't want to have a different IP address for both sides because that would be wasteful.? I would prefer if the hostname would be the same on both sides as well.??? I was trying to explore how carefully the name in the AD computer database is tied to the "real" DNS name of the host.? What I was trying to do was to add to /etc/samba/smb.conf: netbios name=<system hostname>-linux so that when I would join the hosts under Linux, they would take on a "-linux" name, but only in the AD computer database.? When the host was booted, the host would have an AD name of <system hostname>-linux, but a real name of just "<system hostname>".? ? On Windows, both the AD name and hostname would be "<system hostname>".? This would mean that on Windows, you could have a computer called "test", and under Linux, "test-linux", but both would really be the same physical PC and both would be host "test" with one IP. ?? It wasn't working.? I am pretty sure I forgot the nfs/X entries on the NFS servers after rejoining the domain so that may be the issue.? However, thinking back, I also think that "net ads keytab" would not let me add an entry for "host/test...." because it wanted "host/test-linux....", but I could be wrong.? If the host *had* to take on its real identity "test-linux" then test-linux could just be an alias for test, I guess, but then the machine build would be a headache.... and when the Linux machines boot they use dhcp (just like Windows) and the machine wouldn't know if it's "test" or "test-linux". Lots of "fun". Jason.
On 10/12/2020 10:36 AM, Jason Keltz wrote:> > On 10/12/2020 4:06 AM, Rowland penny via samba wrote: >> On 12/10/2020 02:54, Jason Keltz via samba wrote: >>> I've been working on a Samba AD setup with a bunch of test machines >>> - the one DC, and a bunch of clients. Last night, I ended up >>> switching the name of the test machines temporarily (except the DC), >>> and re-joining the domain (that's for another e-mail later). When >>> things didn't work the way I had planned,? I switched the hostnames >>> back, and re-joined the domain today on all the test machines.? I >>> was shocked to find that I am only able to login to the domain on >>> one of my hosts. It fails on all the other ones.? I ensured that I >>> deleted the machine entries from AD.? I haven't changed my Samba >>> config in months which Rowland had last verified was fine.? I >>> haven't changed my /etc/krb5.conf Kerberos config in months.? I even >>> did a complete rebuild of one of the machines since I automated the >>> installation process, and that rebuild was working perfectly many >>> many times, but now it is failed. In winbind log every time I try to >>> login I'm mostly seeing: >> >> Did you leave the domain before you changed the hostname ? >> >> Why did you change the hostnames ? In a case like this, I would have >> set up a new computer, joined this to the domain and then removed the >> old computer from the domain. > > Hi Rowland, > > I did not leave the domain, but I did delete the entry by either the > Windows AD tool or "samba-tool computer delete" option.? I can't > remember which one at this point.? I think that clears up all the > bits.? Is that correct?? On the local host, I also deleted the > /etc/krb5.keytab, and deleted all the samba bits so that the join was > fresh. > > Things are better today.? I discovered one issue which seemingly > unrelated (to me) to the errors seems to have been the cause of a lot > of the trouble.? I was chasing errors in winbind log, but several of > the test servers are NFS servers, and when I rejoined them to the > domain, I didn't replace the nfs/X entries in their keytab.? Now, the > clients couldn't mount, and that definately caused some trouble, for > which I didn't see the signs.? I'm still watching though. However, I > can login to all the hosts now. > > By the way, at one point, I rebooted the DC, and I noticed that all > the AD clients showed something like this: > > [2020/10/12 09:25:19.183616,? 1, pid=36145, effective(0, 0), real(0, > 0)] > ../../source3/rpc_client/cli_pipe.c:422(cli_pipe_validate_current_pdu) > ? ../../source3/rpc_client/cli_pipe.c:422: Bind NACK received from > host dc1.ad.eecs.yorku.ca! > [2020/10/12 09:44:11.598150,? 1, pid=36145, effective(0, 0), real(0, > 0)] ../../source3/libads/ldap_utils.c:93(ads_do_search_retry_internal) > ? Reducing LDAP page size from 1000 to 500 due to IO_TIMEOUT > > (Which is strange because this means that if you reboot he DC, then > the clients start talking slower to it when it comes back up?? I don't > think the number ever increases unless you restart winbind everywhere?) > > and since that reboot, I've seen a few of them do this: > > [2020/10/12 10:00:19.814381,? 1, pid=36145, effective(0, 0), real(0, > 0)] ../../source3/libads/ldap_utils.c:93(ads_do_search_retry_internal) > ? Reducing LDAP page size from 500 to 250 due to IO_TIMEOUT > [2020/10/12 10:16:19.557261,? 1, pid=36145, effective(0, 0), real(0, > 0)] ../../source3/libads/ldap_utils.c:93(ads_do_search_retry_internal) > ? Reducing LDAP page size from 250 to 125 due to IO_TIMEOUT > > Two of them are virtualbox VMs, so I figured maybe it's some kind of > virtualbox thing, but one of them is an actual machine and still has > the same error.? The DC is very lightly loaded.? How would I debug > what is causing this reduction in IO? > > I know that various errors in the Samba logs are not "issues" but this > one seems to be an issue.? I don't like seeing IO_TIMEOUTs. > > Another distracting error in the log included: > > [2020/10/11 22:43:29.843630,? 1, pid=969, effective(0, 0), real(0, 0)] > ../../source3/libads/ldap.c:565(ads_find_dc) > ? ads_find_dc: name resolution for realm 'AD.EECS.YORKU.CA' (domain > 'EECSYORKUCA') failed: NT_STATUS_NO_LOGON_SERVERS > > ... after boot which sounds serious but it turns out if I try to > authenticate before everything is up and running, that's what I get. > The error makes sense but there's no "follow up" to say: "Ok ok - I > found it now - Sorry to give you a heart attack.".? It's all a > learning experience. > > <snipped> > JasonI wanted to add one more thing...? It seems that I'm actually still getting this everywhere when a user logs in: [2020/10/12 10:59:29.958617,? 1, pid=23338, effective(1004, 0), real(1004, 0)] ../../source3/librpc/crypto/gse_krb5.c:417(fill_mem_keytab_from_system_keytab) ? ../../source3/librpc/crypto/gse_krb5.c:417: krb5_kt_start_seq_get failed (Permission denied) ... but at least the user can still login. I wonder if this a regular error and everyone is seeing this in their logs?? Just for fun, I tried to change the permission of /etc/krb5.keytab temporarily to 644, and sure enough, the error goes away....? so somehow when the user is logging in, it seems that winbind is trying to read the keytab as user.? It's not clear why that would be, but while a google search hasn't revealed the reason for this error, I do see it in a whole lot of log files. It's just that when I'm trying to ensure there are no problems with my setup, and trying to understand the errors that do show up, it can cause panic.? Whether it's a problem or not, I do not know. Jason.
On 12/10/2020 16:11, Jason Keltz wrote:> >> Hi Rowland, >> >> I did not leave the domain, but I did delete the entry by either the >> Windows AD tool or "samba-tool computer delete" option.? I can't >> remember which one at this point.? I think that clears up all the >> bits.? Is that correct?? On the local host, I also deleted the >> /etc/krb5.keytab, and deleted all the samba bits so that the join was >> fresh.I would always 'leave' the domain first, before doing anything else.>> >> >> By the way, at one point, I rebooted the DC, and I noticed that all >> the AD clients showed something like this: >> >> [2020/10/12 09:25:19.183616,? 1, pid=36145, effective(0, 0), real(0, >> 0)] >> ../../source3/rpc_client/cli_pipe.c:422(cli_pipe_validate_current_pdu) >> ? ../../source3/rpc_client/cli_pipe.c:422: Bind NACK received from >> host dc1.ad.eecs.yorku.ca! >> [2020/10/12 09:44:11.598150,? 1, pid=36145, effective(0, 0), real(0, >> 0)] ../../source3/libads/ldap_utils.c:93(ads_do_search_retry_internal) >> ? Reducing LDAP page size from 1000 to 500 due to IO_TIMEOUT >> >> (Which is strange because this means that if you reboot he DC, then >> the clients start talking slower to it when it comes back up?? I >> don't think the number ever increases unless you restart winbind >> everywhere?)'page size' refers to the number of records returned, I would be more worried about the 'IO_TIMEOUT'>> >> and since that reboot, I've seen a few of them do this: >> >> [2020/10/12 10:00:19.814381,? 1, pid=36145, effective(0, 0), real(0, >> 0)] ../../source3/libads/ldap_utils.c:93(ads_do_search_retry_internal) >> ? Reducing LDAP page size from 500 to 250 due to IO_TIMEOUT >> [2020/10/12 10:16:19.557261,? 1, pid=36145, effective(0, 0), real(0, >> 0)] ../../source3/libads/ldap_utils.c:93(ads_do_search_retry_internal) >> ? Reducing LDAP page size from 250 to 125 due to IO_TIMEOUT >> >> Two of them are virtualbox VMs, so I figured maybe it's some kind of >> virtualbox thing, but one of them is an actual machine and still has >> the same error.? The DC is very lightly loaded. How would I debug >> what is causing this reduction in IO?I would be checked your network connections etc.>> >> I know that various errors in the Samba logs are not "issues" but >> this one seems to be an issue.? I don't like seeing IO_TIMEOUTs. >> >> Another distracting error in the log included: >> >> [2020/10/11 22:43:29.843630,? 1, pid=969, effective(0, 0), real(0, >> 0)] ../../source3/libads/ldap.c:565(ads_find_dc) >> ? ads_find_dc: name resolution for realm 'AD.EECS.YORKU.CA' (domain >> 'EECSYORKUCA') failed: NT_STATUS_NO_LOGON_SERVERSThat make me think of dns/network problems.>> >> ... after boot which sounds serious but it turns out if I try to >> authenticate before everything is up and running, that's what I get. >> The error makes sense but there's no "follow up" to say: "Ok ok - I >> found it now - Sorry to give you a heart attack.". It's all a >> learning experience. >> >> <snipped> >> Jason > > > > I wonder if this a regular error and everyone is seeing this in their > logs?? Just for fun, I tried to change the permission of > /etc/krb5.keytab temporarily to 644, and sure enough, the error goes > away....? so somehow when the user is logging in, it seems that > winbind is trying to read the keytab as user.? It's not clear why that > would be, but while a google search hasn't revealed the reason for > this error, I do see it in a whole lot of log files. It's just that > when I'm trying to ensure there are no problems with my setup, and > trying to understand the errors that do show up, it can cause panic.? > Whether it's a problem or not, I do not knowThe keytab shouldn't be a problem, what are the permissions on /etc/krb5.conf ? Rowland The permissio