Michael Tokarev
2022-Feb-11 21:34 UTC
[Samba] Corruption of winbind cache after converting NT4 to AD domain
Hi! We've been using NT4 domain with samba for many years (more than a decade for sure), quite successfully. And instead of fighting with it every time, we finally decided to convert it to AD. And with that, we faced numerous quite bad issues, so that our network isn't working right for over a week already. Here's one of the issues (more to follow). I created a new machine for the DC, parallel to the fileserver which was everything at once. Copied all configuration and data to it, and did classicupgrade there. Which worked fine after several attempts (we had to fix some issues, that's ok). The main fileserver - I stopped it, moved everything out, leaving just the share definitions in conffile, and joined it to the domain (net ads join member). Which also went fine. and after configuring nsswitch and other stuff, it started working. And immediately we faced a problem with roaming profiles - at first windows did everything but after a few logins/logouts it refused to syncronize profile telling that its owner is wrong - "Unix user mjt" instead of "DOMAIN\mjt". After long and painful debugging (since there's very little info about how it all works, which components does what and how it all should be done) it all boiled down to winbind cache corruption/pollution. Somewhat similar to this one: https://lists.samba.org/archive/samba-technical/2019-February/132730.html except that in our case it is different. After net cache flush I lookup every uid we have with wblookup --uid-info. Everything's fine, every uid is looked up fine. But after some random time, wbinfo --uid-info start to return DOMAIN_NOT_FOUND errors to one or two, some more time and the amount of "not found" entries grows and grows. winbindd_getpwuid_send: [wbinfo (47371)] getpwuid 1068 Opening cache file at /run/samba/gencache.tdb wbint_UnixIDs2Sids: struct wbint_UnixIDs2Sids in: struct wbint_UnixIDs2Sids domain_name : * domain_name : 'TSRV' domain_sid : S-1-5-21-489615817-366373558-193322279 num_ids : 0x00000001 (1) xids: ARRAY(1) xids: struct unixid id : 0x0000042c (1068) type : ID_TYPE_UID (1) (this is request by wbinfo). I don't know why domain_name is TSRV - TSRV is the file server (netbios and host name). the domain in question is TLS.MSK.RU (TLS). But that's not the issue, it works so far. idmap_backend is ad, fwiw. ..lots of info... idmap_ad_unixids_to_sids: Mapped S-1-5-21-411424318-379842365-2075518510-1024 -> 1068 (1) ... gencache_set_data_blob: Adding cache entry with key=[IDMAP/SID2XID/S-1-5-21-411424318-379842365-2075518510-1024] and timeout=[...] (604800 seconds ahead) gencache_set_data_blob: Adding cache entry with key=[IDMAP/UID2SID/1068] and timeout=[...] (604800 seconds ahead) ... Finding user BAY Trying _Get_Pwnam(), username as lowercase is bay Get_Pwnam_internals did find user [BAY]! init_lsa_rids: BAY found gencache_set_data_blob: Adding cache entry with key=[NAME2SID/\BAY] and timeout=[...] (300 seconds ahead) gencache_set_data_blob: Adding cache entry with key=[SID2NAME/S-1-22-1-1068] and timeout=[...] (300 seconds ahead) gencache_set_data_blob: Adding cache entry with key=[NAME2SID/UNIX USER\BAY] and timeout=[...] (300 seconds ahead) Finished processing child request 56 Writing 4032 bytes to parent wbint_LookupName: struct wbint_LookupName out: struct wbint_LookupName type : * type : SID_NAME_USER (1) sid : * sid : S-1-22-1-1068 result : NT_STATUS_OK ... gencache_set_data_blob: Adding cache entry with key=[IDMAP/SID2XID/S-1-22-1-1068] and timeout=[?? ??? 18 01:00:59 2022 MSK] (604800 seconds ahead) gencache_set_data_blob: Adding cache entry with key=[IDMAP/UID2SID/1068] and timeout=[?? ??? 18 01:00:59 2022 MSK] (604800 seconds ahead) (here, as far as I can tell, the value it wrote for IDMAP/UID2SID/1068 is S-1-22-1-1068) Finished processing child request 56 Writing 4040 bytes to parent wbint_Sids2UnixIDs: struct wbint_Sids2UnixIDs out: struct wbint_Sids2UnixIDs ids : * ids: struct wbint_TransIDArray num_ids : 0x00000001 (1) ids: ARRAY(1) ids: struct wbint_TransID type_hint : ID_TYPE_NOT_SPECIFIED (0) domain_index : 0x00000000 (0) rid : 0x0000042c (1068) xid: struct unixid id : 0x0000042c (1068) type : ID_TYPE_UID (1) result : NT_STATUS_OK gencache_set_data_blob: Adding cache entry with key=[IDMAP/SID2XID/S-1-22-1-1068] and timeout=[?? ??? 18 01:00:59 2022 MSK] (604800 seconds ahead) gencache_set_data_blob: Adding cache entry with key=[IDMAP/UID2SID/1068] and timeout=[?? ??? 18 01:00:59 2022 MSK] (604800 seconds ahead) ... and next it starts to return errors: Could not convert sid S-1-22-1-1068: NT_STATUS_NO_SUCH_USER process_request_done: [nss_winbind(47509):GETGROUPS]: NT_STATUS_NO_SUCH_USER winbindd_getpwuid_send: [wbinfo (47516)] getpwuid 1068 wb_xids2sids_send: Found UID in cache: S-1-22-1-1068 Could not convert sid S-1-22-1-1068: NT_STATUS_INVALID_PARAMETER process_request_done: [wbinfo(47516):GETPWUID]: NT_STATUS_INVALID_PARAMETER process_request_written: [wbinfo(47516):GETPWUID]: delivered response to client etc. There are just selected parts of the picture, whole winbind trace file is here: http://www.corpit.ru/mjt/tmp/winbind.trc Obviously, from now on, uid 1068 does not work anymore. Over time, more and more uids stops working, until next `net cache flush'. Now, the most "interesting" part, besides the obvious wrong behavour somewhere. For a long time, we had unix users with their own regular home directories, shell access and lots of work in linux. As far as I can see, in order to use AD domain, we should convert linux users to AD, so that a user is EITHER in linux OR in AD, but not both. I found nothing conclusive about this, it is just my gut feeling, - there's no direct requirement like this in the docs I found so far. But I see that people do it like this, not mixing uids and usernames. It is just my gut feeling maybe I'm wrong.. So there are two parts of the question: First, how such setup should be done? We really used to linux auth and linux work, it's somewhat unnatural to rely on the AD when dealing with local linux accounts. But at the same time, these account should have access from windows to their files. And most important, _why_ this setup should be done? And second, what to do with this cache corruption, how to prevent it? Is it possible to perform AD auth by samba AND linux auth when logging in to the linux machine? Adding --no-cache to winbind command line helped, but this obviously is not a good solution... System info: samba 4.13.13+dfsg-1~deb11u2 on debian bullseye, current. smb.conf: [global] server string = %h samba server %v netbios name = TSRV netbios aliases = LINUX FS realm = TLS.MSK.RU workgroup = TLS server role = member server security = ADS idmap config TLS : backend = ad idmap config TLS : range = 1000-3000 idmap config TLS : schema_mode = rfc2307 idmap config TLS : unix_primary_group = yes template homedir = /home/%U idmap config * : backend = tdb idmap config * : range = 5000-7000 ...share definitions... Thank you for the time! It turned out to be quite a bit longer than I expected... /mjt
Patrick Goetz
2022-Feb-11 22:01 UTC
[Samba] Corruption of winbind cache after converting NT4 to AD domain
On 2/11/22 15:34, Michael Tokarev via samba wrote:> > For a long time, we had unix users with their own regular home directories, > shell access and lots of work in linux.? As far as I can see, in order to > use AD domain, we should convert linux users to AD, so that a user is > EITHER > in linux OR in AD, but not both.? I found nothing conclusive about this, it > is just my gut feeling, - there's no direct requirement like this in the > docs > I found so far.? But I see that people do it like this, not mixing uids and > usernames.? It is just my gut feeling maybe I'm wrong.. > > So there are two parts of the question: > > First, how such setup should be done? We really used to linux auth and > linux > work, it's somewhat unnatural to rely on the AD when dealing with local > linux > accounts.? But at the same time, these account should have access from > windows > to their files.? And most important, _why_ this setup should be done? > > And second, what to do with this cache corruption, how to prevent it? Is it > possible to perform AD auth by samba AND linux auth when logging in to > the linux > machine?? Adding --no-cache to winbind command line helped, but this > obviously > is not a good solution... >I just moved from NT4 to Samba AD too. My original plan was to leave the linux machines standalone, but the more I worked with the system the more obvious it became that this was a bad idea for various reasons; e.g. the access permissions on filesystems shared to Windows machines aren't the same if you don't mind the linux workstation to the domain. So, what I'm currently doing on the linux machines: 1. Remove local linux accounts which match AD accounts. 2. Bind the linux machine to the domain 3. Reset the permissions on the /home/USER directories on the linux machines to match the UID assigned by Samba. If you're using security groups, these work, too, and you can assign permissions on linux with these, too. This seems to work pretty well and avoids the complications of using, say, autofs. You're just using AD for authentication in this case, although of course you can mount shares, too. I *don't*, and continue to use NFS to mount file systems between linux machines. You can also make this work if you have a home directory server with autofs clients. Just execute the above on the home directory server and make sure your linux clients are using AD to authenticate.
Rowland Penny
2022-Feb-11 22:05 UTC
[Samba] Corruption of winbind cache after converting NT4 to AD domain
On Sat, 2022-02-12 at 00:34 +0300, Michael Tokarev via samba wrote:> Hi! > > We've been using NT4 domain with samba for many years (more than a > decade for sure), > quite successfully. And instead of fighting with it every time, we > finally decided > to convert it to AD. And with that, we faced numerous quite bad > issues, so that > our network isn't working right for over a week already. Here's one > of the issues > (more to follow). > > I created a new machine for the DC, parallel to the fileserver which > was everything > at once. Copied all configuration and data to it, and did > classicupgrade there. > Which worked fine after several attempts (we had to fix some issues, > that's ok). > > The main fileserver - I stopped it, moved everything out, leaving > just the share > definitions in conffile, and joined it to the domain (net ads join > member). Which > also went fine. and after configuring nsswitch and other stuff, it > started working. > > And immediately we faced a problem with roaming profiles - at first > windows did > everything but after a few logins/logouts it refused to syncronize > profile telling > that its owner is wrong - "Unix user mjt" instead of "DOMAIN\mjt". > > After long and painful debugging (since there's very little info > about how it all > works, which components does what and how it all should be done) it > all boiled down > to winbind cache corruption/pollution. Somewhat similar to this one: > > > https://lists.samba.org/archive/samba-technical/2019-February/132730.html > > except that in our case it is different. > > After net cache flush I lookup every uid we have with wblookup --uid- > info. > Everything's fine, every uid is looked up fine. But after some > random > time, wbinfo --uid-info start to return DOMAIN_NOT_FOUND errors to > one or > two, some more time and the amount of "not found" entries grows and > grows. > > > > > There are just selected parts of the picture, whole winbind trace > file is here: > http://www.corpit.ru/mjt/tmp/winbind.trc > > Obviously, from now on, uid 1068 does not work anymore. Over time, > more and more > uids stops working, until next `net cache flush'. > > > Now, the most "interesting" part, besides the obvious wrong behavour > somewhere. > > For a long time, we had unix users with their own regular home > directories, > shell access and lots of work in linux. As far as I can see, in > order to > use AD domain, we should convert linux users to AD, so that a user is > EITHER > in linux OR in AD, but not both. I found nothing conclusive about > this,The old way was to have a Unix user and a Samba user, this mapped Windows users to Unix users. Now, with AD, you only have one user and that user is stored in AD. Winbind maps the AD user to a Unix ID and hence makes the user a Unix user. This all means that if you have a user called 'fred' in AD and /etc/passwd , you should remove the local Unix user from /etc/passwd.> it > is just my gut feeling, - there's no direct requirement like this in > the docsThis was explained in the Samba wiki, but someone has just removed it.> I found so far. But I see that people do it like this, not mixing > uids and > usernames. It is just my gut feeling maybe I'm wrong..It is not so much that you are mixing uids and usernames, you seem to be possibly mixing users.> > So there are two parts of the question: > > First, how such setup should be done? We really used to linux auth > and linux > work, it's somewhat unnatural to rely on the AD when dealing with > local linux > accounts. But at the same time, these account should have access > from windows > to their files. And most important, _why_ this setup should be done?You should only have users in AD and 'getent passwd username' should produce output, something like this: rowland at devstation:~$ getent passwd rowland rowland:*:10000:10000:Rowland Penny:/home/rowland:/bin/bash I can assure that 'rowland' isn't in /etc/passwd> > And second, what to do with this cache corruption, how to prevent it?Setup your system correctly.> Is it > possible to perform AD auth by samba AND linux auth when logging in > to the linux > machine? Adding --no-cache to winbind command line helped, but this > obviously > is not a good solution...No, it is BAD solution.> > System info: > > samba 4.13.13+dfsg-1~deb11u2 on debian bullseye, current. > > smb.conf: > [global] > server string = %h samba server %v > netbios name = TSRV > netbios aliases = LINUX FSI do not recommend using 'netbios aliases' use a dns 'CNAME' instead.> realm = TLS.MSK.RU > workgroup = TLS > server role = member server > security = ADS > > idmap config TLS : backend = ad > idmap config TLS : range = 1000-3000 > idmap config TLS : schema_mode = rfc2307 > idmap config TLS : unix_primary_group = yes > template homedir = /home/%U > idmap config * : backend = tdb > idmap config * : range = 5000-7000 > > ...share definitions... > > Thank you for the time! It turned out to be quite a bit longer than I > expected...No problem, I await your further questions :-) Rowland