Michael Tokarev
2022-Feb-11 21:34 UTC
[Samba] Corruption of winbind cache after converting NT4 to AD domain
Hi!
We've been using NT4 domain with samba for many years (more than a decade
for sure),
quite successfully. And instead of fighting with it every time, we finally
decided
to convert it to AD. And with that, we faced numerous quite bad issues, so that
our network isn't working right for over a week already. Here's one of
the issues
(more to follow).
I created a new machine for the DC, parallel to the fileserver which was
everything
at once. Copied all configuration and data to it, and did classicupgrade there.
Which worked fine after several attempts (we had to fix some issues, that's
ok).
The main fileserver - I stopped it, moved everything out, leaving just the share
definitions in conffile, and joined it to the domain (net ads join member).
Which
also went fine. and after configuring nsswitch and other stuff, it started
working.
And immediately we faced a problem with roaming profiles - at first windows did
everything but after a few logins/logouts it refused to syncronize profile
telling
that its owner is wrong - "Unix user mjt" instead of
"DOMAIN\mjt".
After long and painful debugging (since there's very little info about how
it all
works, which components does what and how it all should be done) it all boiled
down
to winbind cache corruption/pollution. Somewhat similar to this one:
https://lists.samba.org/archive/samba-technical/2019-February/132730.html
except that in our case it is different.
After net cache flush I lookup every uid we have with wblookup --uid-info.
Everything's fine, every uid is looked up fine. But after some random
time, wbinfo --uid-info start to return DOMAIN_NOT_FOUND errors to one or
two, some more time and the amount of "not found" entries grows and
grows.
winbindd_getpwuid_send: [wbinfo (47371)] getpwuid 1068
Opening cache file at /run/samba/gencache.tdb
wbint_UnixIDs2Sids: struct wbint_UnixIDs2Sids
in: struct wbint_UnixIDs2Sids
domain_name : *
domain_name : 'TSRV'
domain_sid : S-1-5-21-489615817-366373558-193322279
num_ids : 0x00000001 (1)
xids: ARRAY(1)
xids: struct unixid
id : 0x0000042c (1068)
type : ID_TYPE_UID (1)
(this is request by wbinfo). I don't know why domain_name is TSRV - TSRV is
the file server (netbios and host name). the domain in question is TLS.MSK.RU
(TLS). But that's not the issue, it works so far. idmap_backend is ad,
fwiw.
..lots of info...
idmap_ad_unixids_to_sids: Mapped S-1-5-21-411424318-379842365-2075518510-1024
-> 1068 (1)
...
gencache_set_data_blob: Adding cache entry with
key=[IDMAP/SID2XID/S-1-5-21-411424318-379842365-2075518510-1024] and
timeout=[...] (604800 seconds ahead)
gencache_set_data_blob: Adding cache entry with key=[IDMAP/UID2SID/1068] and
timeout=[...] (604800 seconds ahead)
...
Finding user BAY
Trying _Get_Pwnam(), username as lowercase is bay
Get_Pwnam_internals did find user [BAY]!
init_lsa_rids: BAY found
gencache_set_data_blob: Adding cache entry with key=[NAME2SID/\BAY] and
timeout=[...] (300 seconds ahead)
gencache_set_data_blob: Adding cache entry with key=[SID2NAME/S-1-22-1-1068] and
timeout=[...] (300 seconds ahead)
gencache_set_data_blob: Adding cache entry with key=[NAME2SID/UNIX USER\BAY] and
timeout=[...] (300 seconds ahead)
Finished processing child request 56
Writing 4032 bytes to parent
wbint_LookupName: struct wbint_LookupName
out: struct wbint_LookupName
type : *
type : SID_NAME_USER (1)
sid : *
sid : S-1-22-1-1068
result : NT_STATUS_OK
...
gencache_set_data_blob: Adding cache entry with
key=[IDMAP/SID2XID/S-1-22-1-1068] and timeout=[?? ??? 18 01:00:59 2022 MSK]
(604800 seconds ahead)
gencache_set_data_blob: Adding cache entry with key=[IDMAP/UID2SID/1068] and
timeout=[?? ??? 18 01:00:59 2022 MSK] (604800 seconds ahead)
(here, as far as I can tell, the value it wrote for IDMAP/UID2SID/1068 is
S-1-22-1-1068)
Finished processing child request 56
Writing 4040 bytes to parent
wbint_Sids2UnixIDs: struct wbint_Sids2UnixIDs
out: struct wbint_Sids2UnixIDs
ids : *
ids: struct wbint_TransIDArray
num_ids : 0x00000001 (1)
ids: ARRAY(1)
ids: struct wbint_TransID
type_hint : ID_TYPE_NOT_SPECIFIED
(0)
domain_index : 0x00000000 (0)
rid : 0x0000042c (1068)
xid: struct unixid
id : 0x0000042c (1068)
type : ID_TYPE_UID (1)
result : NT_STATUS_OK
gencache_set_data_blob: Adding cache entry with
key=[IDMAP/SID2XID/S-1-22-1-1068] and timeout=[?? ??? 18 01:00:59 2022 MSK]
(604800 seconds ahead)
gencache_set_data_blob: Adding cache entry with key=[IDMAP/UID2SID/1068] and
timeout=[?? ??? 18 01:00:59 2022 MSK] (604800 seconds ahead)
...
and next it starts to return errors:
Could not convert sid S-1-22-1-1068: NT_STATUS_NO_SUCH_USER
process_request_done: [nss_winbind(47509):GETGROUPS]: NT_STATUS_NO_SUCH_USER
winbindd_getpwuid_send: [wbinfo (47516)] getpwuid 1068
wb_xids2sids_send: Found UID in cache: S-1-22-1-1068
Could not convert sid S-1-22-1-1068: NT_STATUS_INVALID_PARAMETER
process_request_done: [wbinfo(47516):GETPWUID]: NT_STATUS_INVALID_PARAMETER
process_request_written: [wbinfo(47516):GETPWUID]: delivered response to client
etc.
There are just selected parts of the picture, whole winbind trace file is here:
http://www.corpit.ru/mjt/tmp/winbind.trc
Obviously, from now on, uid 1068 does not work anymore. Over time, more and
more
uids stops working, until next `net cache flush'.
Now, the most "interesting" part, besides the obvious wrong behavour
somewhere.
For a long time, we had unix users with their own regular home directories,
shell access and lots of work in linux. As far as I can see, in order to
use AD domain, we should convert linux users to AD, so that a user is EITHER
in linux OR in AD, but not both. I found nothing conclusive about this, it
is just my gut feeling, - there's no direct requirement like this in the
docs
I found so far. But I see that people do it like this, not mixing uids and
usernames. It is just my gut feeling maybe I'm wrong..
So there are two parts of the question:
First, how such setup should be done? We really used to linux auth and linux
work, it's somewhat unnatural to rely on the AD when dealing with local
linux
accounts. But at the same time, these account should have access from windows
to their files. And most important, _why_ this setup should be done?
And second, what to do with this cache corruption, how to prevent it? Is it
possible to perform AD auth by samba AND linux auth when logging in to the linux
machine? Adding --no-cache to winbind command line helped, but this obviously
is not a good solution...
System info:
samba 4.13.13+dfsg-1~deb11u2 on debian bullseye, current.
smb.conf:
[global]
server string = %h samba server %v
netbios name = TSRV
netbios aliases = LINUX FS
realm = TLS.MSK.RU
workgroup = TLS
server role = member server
security = ADS
idmap config TLS : backend = ad
idmap config TLS : range = 1000-3000
idmap config TLS : schema_mode = rfc2307
idmap config TLS : unix_primary_group = yes
template homedir = /home/%U
idmap config * : backend = tdb
idmap config * : range = 5000-7000
...share definitions...
Thank you for the time! It turned out to be quite a bit longer than I
expected...
/mjt
Patrick Goetz
2022-Feb-11 22:01 UTC
[Samba] Corruption of winbind cache after converting NT4 to AD domain
On 2/11/22 15:34, Michael Tokarev via samba wrote:> > For a long time, we had unix users with their own regular home directories, > shell access and lots of work in linux.? As far as I can see, in order to > use AD domain, we should convert linux users to AD, so that a user is > EITHER > in linux OR in AD, but not both.? I found nothing conclusive about this, it > is just my gut feeling, - there's no direct requirement like this in the > docs > I found so far.? But I see that people do it like this, not mixing uids and > usernames.? It is just my gut feeling maybe I'm wrong.. > > So there are two parts of the question: > > First, how such setup should be done? We really used to linux auth and > linux > work, it's somewhat unnatural to rely on the AD when dealing with local > linux > accounts.? But at the same time, these account should have access from > windows > to their files.? And most important, _why_ this setup should be done? > > And second, what to do with this cache corruption, how to prevent it? Is it > possible to perform AD auth by samba AND linux auth when logging in to > the linux > machine?? Adding --no-cache to winbind command line helped, but this > obviously > is not a good solution... >I just moved from NT4 to Samba AD too. My original plan was to leave the linux machines standalone, but the more I worked with the system the more obvious it became that this was a bad idea for various reasons; e.g. the access permissions on filesystems shared to Windows machines aren't the same if you don't mind the linux workstation to the domain. So, what I'm currently doing on the linux machines: 1. Remove local linux accounts which match AD accounts. 2. Bind the linux machine to the domain 3. Reset the permissions on the /home/USER directories on the linux machines to match the UID assigned by Samba. If you're using security groups, these work, too, and you can assign permissions on linux with these, too. This seems to work pretty well and avoids the complications of using, say, autofs. You're just using AD for authentication in this case, although of course you can mount shares, too. I *don't*, and continue to use NFS to mount file systems between linux machines. You can also make this work if you have a home directory server with autofs clients. Just execute the above on the home directory server and make sure your linux clients are using AD to authenticate.
Rowland Penny
2022-Feb-11 22:05 UTC
[Samba] Corruption of winbind cache after converting NT4 to AD domain
On Sat, 2022-02-12 at 00:34 +0300, Michael Tokarev via samba wrote:> Hi! > > We've been using NT4 domain with samba for many years (more than a > decade for sure), > quite successfully. And instead of fighting with it every time, we > finally decided > to convert it to AD. And with that, we faced numerous quite bad > issues, so that > our network isn't working right for over a week already. Here's one > of the issues > (more to follow). > > I created a new machine for the DC, parallel to the fileserver which > was everything > at once. Copied all configuration and data to it, and did > classicupgrade there. > Which worked fine after several attempts (we had to fix some issues, > that's ok). > > The main fileserver - I stopped it, moved everything out, leaving > just the share > definitions in conffile, and joined it to the domain (net ads join > member). Which > also went fine. and after configuring nsswitch and other stuff, it > started working. > > And immediately we faced a problem with roaming profiles - at first > windows did > everything but after a few logins/logouts it refused to syncronize > profile telling > that its owner is wrong - "Unix user mjt" instead of "DOMAIN\mjt". > > After long and painful debugging (since there's very little info > about how it all > works, which components does what and how it all should be done) it > all boiled down > to winbind cache corruption/pollution. Somewhat similar to this one: > > > https://lists.samba.org/archive/samba-technical/2019-February/132730.html > > except that in our case it is different. > > After net cache flush I lookup every uid we have with wblookup --uid- > info. > Everything's fine, every uid is looked up fine. But after some > random > time, wbinfo --uid-info start to return DOMAIN_NOT_FOUND errors to > one or > two, some more time and the amount of "not found" entries grows and > grows. > > > > > There are just selected parts of the picture, whole winbind trace > file is here: > http://www.corpit.ru/mjt/tmp/winbind.trc > > Obviously, from now on, uid 1068 does not work anymore. Over time, > more and more > uids stops working, until next `net cache flush'. > > > Now, the most "interesting" part, besides the obvious wrong behavour > somewhere. > > For a long time, we had unix users with their own regular home > directories, > shell access and lots of work in linux. As far as I can see, in > order to > use AD domain, we should convert linux users to AD, so that a user is > EITHER > in linux OR in AD, but not both. I found nothing conclusive about > this,The old way was to have a Unix user and a Samba user, this mapped Windows users to Unix users. Now, with AD, you only have one user and that user is stored in AD. Winbind maps the AD user to a Unix ID and hence makes the user a Unix user. This all means that if you have a user called 'fred' in AD and /etc/passwd , you should remove the local Unix user from /etc/passwd.> it > is just my gut feeling, - there's no direct requirement like this in > the docsThis was explained in the Samba wiki, but someone has just removed it.> I found so far. But I see that people do it like this, not mixing > uids and > usernames. It is just my gut feeling maybe I'm wrong..It is not so much that you are mixing uids and usernames, you seem to be possibly mixing users.> > So there are two parts of the question: > > First, how such setup should be done? We really used to linux auth > and linux > work, it's somewhat unnatural to rely on the AD when dealing with > local linux > accounts. But at the same time, these account should have access > from windows > to their files. And most important, _why_ this setup should be done?You should only have users in AD and 'getent passwd username' should produce output, something like this: rowland at devstation:~$ getent passwd rowland rowland:*:10000:10000:Rowland Penny:/home/rowland:/bin/bash I can assure that 'rowland' isn't in /etc/passwd> > And second, what to do with this cache corruption, how to prevent it?Setup your system correctly.> Is it > possible to perform AD auth by samba AND linux auth when logging in > to the linux > machine? Adding --no-cache to winbind command line helped, but this > obviously > is not a good solution...No, it is BAD solution.> > System info: > > samba 4.13.13+dfsg-1~deb11u2 on debian bullseye, current. > > smb.conf: > [global] > server string = %h samba server %v > netbios name = TSRV > netbios aliases = LINUX FSI do not recommend using 'netbios aliases' use a dns 'CNAME' instead.> realm = TLS.MSK.RU > workgroup = TLS > server role = member server > security = ADS > > idmap config TLS : backend = ad > idmap config TLS : range = 1000-3000 > idmap config TLS : schema_mode = rfc2307 > idmap config TLS : unix_primary_group = yes > template homedir = /home/%U > idmap config * : backend = tdb > idmap config * : range = 5000-7000 > > ...share definitions... > > Thank you for the time! It turned out to be quite a bit longer than I > expected...No problem, I await your further questions :-) Rowland