thr3ads.net - samba - [Samba] Bad SMB2 (sign_algo_id=1) signature for message? [Oct 2023]

If this information is useful, please help other people find it:
Share via:
Jeff Saxe
2023-Oct-05 21:58 UTC
[Samba] Bad SMB2 (sign_algo_id=1) signature for message?

Ah, wonderful! So I'm not crazy!

Good day, Michael, Jeremy, and other Samba list members. My name is Jeff Saxe,
and I'm an IT staff member at Quantitative Investment Management in
Charlottesville, Virginia, US. I have some more information to contribute on
this issue. I hope that this email adds on to Michael's previous email from
Feb. 8th 2023. I was actually reading your message through the list's
Archives web site, not having previously been subscribed to this list, and
I'm only subscribing now; so I cannot make my email client add that
"In-Reply-To" header that might assist the automatic threading.

We, too, have been struggling with this "Bad SMB2 signature for
message" random problem, except in our case, the clients that are
experiencing the issue are not Windows at all, but other Linux machines running
the CIFS-mount client in their kernels. The messages in our
/var/log/samba/log.CLIENT.IP.ADDRESS.HERE are exactly the same as what Michael
is quoting ? "Bad SMB2 (sign_slgo_id=1) signature for message",
mentioning that exact same line number 722 of smb2_signing.c. If I "tail
-F" that log file, I can see these same messages repeating every 2 seconds,
far more often than Michael sees from his Windows clients. So whatever the Linux
client is doing, it is extremely persistent in retrying, and it never succeeds
but also doesn't give up (doesn't time out or stop doing it for minutes
or even hours).

This is all Ubuntu 20.04LTS, both the Samba server and the clients. The file
server currently has version 4.15 of Samba; "apt-cache policy" says it
could get version 4.11.6 from focal/main, but this was overridden by 4.15.13
from focal-security/main or focal-updates/main. We have about 10 client
machines, which don't have the full samba package installed (they don't
need to be CIFS servers, only clients), so they have the cifs-utils package,
version 6.9-1, from which userland CLI "mount-cifs -V" says it is
version 6.9. And their kernel, which I believe has the actual protocol
implementation, happens to be 5.4.0-153. These machines are used simultaneously
by several end users, and they all have "autofs" mounts that can mount
and unmount at any time, using Kerberos credentials so that each user ends up
getting access to shared files on the server under his or her security context.
So far all of this works just great, and the vast majority of the time, the end
users are very happy.

But occasionally, randomly (unfortunately I cannot recreate this issue on
demand), one of the users has a persistent failing share mount, such that from
their side, they see "Permission denied" and cannot list or
change-directory into the spot where it's mounted. The problem doesn't
seem to be with autofs itself, although I can't guarantee that; I think it
would still happen if the mounts were manual (a human typing mount.cifs at a
shell prompt) or were in /etc/fstab (mounting once every time the client machine
boots). And it does not affect all the users on the machine, nor does it affect
that same user mounting the same shares from other client machines! It seems to
be an isolated random flake, maybe twice a week. At any rate, I can log into the
client machine and, even if I sudo to root, I cannot list the directory either ?
and curiously, we see some question marks for the permissions, owners (user and
group), and other metadata about the mounted directory. I will anonymize the
user and folder names below. The first is an unaffected, perfectly fine set of
shares for one user "bobby", and the second is a user "jack"
who is currently experiencing broken shares on 2 out of his 3 mounts.


/mnt/bobby:
total 64
drwx------ 2 bobby root     0 Aug  3 15:40 share1
drwx------ 2 bobby root 65536 Sep 28 09:20 share2
drwx------ 2 bobby root     0 Sep 29 15:36 share3

/mnt/jack:
ls: cannot access '/mnt/jack/share1': Permission denied
ls: cannot access '/mnt/jack/share3': Permission denied
total 64
d????????? ? ?           ?        ?            ? share1
drwx------ 2 jack        root 65536 Sep 28 09:20 share2
d????????? ? ?           ?        ?            ? share3


So the two shares "share1" and "share3" are mounted from
this Linux file server that is currently generating the message. The other share
"share2" happens to be mounted from a Windows-OS CIFS server elsewhere
in the same Active Directory domain. When the problem happens, only share1 and
share3 are affected; the end user has no problem accessing files from share2,
which consumes his exact same Kerberos credential cache file to do the mounting,
so it's not a general problem with the user's domain account, like
password locked or something.

Once this happens, the end user has no way to fix it himself. I can (as root)
"umount /mnt/jack/share1"; if jack happens to have a process stuck
with some I/O or a current directory within that share, then umount says
it's busy, but I can add the --lazy option and "umount -l" it, and
it goes ahead and unmounts. Then after I've unmounted both share1 and
share3, then jack can try again to go into the subdirectory, and autofs will
successfully mount it just fine, and he's fixed and can get back to work.

However, I have also been trying to dissect if this whole thing is a client or
server problem, and I noticed today that I see something unusual on the server
end in the output of "smbstatus", specifically the top section
"smbstatus --processes". The normal-looking output would be something
like...

1152377 bobby        domain users 10.1.0.108 (ipv4:10.1.0.108:54124)       
SMB3_11           -                    partial(AES-128-CMAC)
1152377 jack         domain users 10.1.0.108 (ipv4:10.1.0.108:54124)       
SMB3_11           -                    partial(AES-128-CMAC)
1152377 jack         domain users 10.1.0.108 (ipv4:10.1.0.108:54124)       
SMB3_11           -                    partial(AES-128-CMAC)
1152383 mallory      domain users 10.1.0.117 (ipv4:10.1.0.117:59624)       
SMB3_11           -                    partial(AES-128-CMAC)
1152383 alice        domain users 10.1.0.117 (ipv4:10.1.0.117:59624)       
SMB3_11           -                    partial(AES-128-CMAC)
1152383 richard      domain users 10.1.0.117 (ipv4:10.1.0.117:59624)       
SMB3_11           -                    partial(AES-128-CMAC)
1152383 richard      domain users 10.1.0.117 (ipv4:10.1.0.117:59624)       
SMB3_11           -                    partial(AES-128-CMAC)


...with each line showing what protocol and what signing algorithm is currently
in effect between that client and this server. But when the problem is
happening, perhaps the protocol is stuck while negotiating the SMB2 signing,
because that last field just has a single hyphen in place of "partial
(AES-128-CMAC)". So I have taken this as a useful indicator that the issue
is happening, and in fact, if I grab the PID from the beginning of the line that
is showing this lack-of-a-signing-algorithm, and I do a "kill 1152383"
at the shell prompt (as root, no particular kill signal so I guess it is
SIGTERM), then the problem appears to be instantly cleared. The "tail
-F" of the client-specific Samba log file stops scrolling, and the question
marks and Permission denied on the client go away, and the client successfully
mount or remounts the share, and the end user is happy again.

So just this morning I made a hackish Python script, run as a cron job every few
minutes, that grep's the output of smbstatus for those suspicious lines and
kills off the malfunctioning PID (after waiting for 10 seconds and repeating the
smbstatus command, to make sure it's an actual case and not just a race
condition with a brand-new connection). I have some hope that this ridiculous
hack will work around the problem and keep it away from my end users while
hopefully someone can track down the root cause. Is there anything I can do
(config files, logging, tcpdump captures, etc.) to help you find the actual
cause? The file server is quite busy with high-bandwidth legitimate traffic
every day, so running a continuous tcpdump writing to disk all the time would be
very painful, especially because I can't recreate the problem at will.

Thanks very much, Jeremy and anyone else reading. I hope we can work together to
find and smash this bug permanently. Let me know how I can help.


? Jeff Saxe

Jeff.Saxe at Quantitative.com<mailto:Jeff.Saxe at Quantitative.com>
samba - Oct 2023 - Bad SMB2 (sign_algo_id=1) signature for message?

[Samba] Bad SMB2 (sign_algo_id=1) signature for message?