On Fri, Jun 04, 2004 at 01:20:54PM -0400, Joey Hess
wrote:> My colocated server was refusing both ssh and ssl telnet connections.
> It looked like this:
>
> joey:~>ssh -v kite
> OpenSSH_3.8.1p1 Debian 1:3.8.1p1-4, OpenSSL 0.9.7d 17 Mar 2004
> debug1: Reading configuration data /home/joey/.ssh/config
> debug1: Applying options for kite
> debug1: Reading configuration data /etc/ssh/ssh_config
> debug1: Connecting to kite [64.62.161.42] port 22.
> debug1: Connection established.
> debug1: identity file /home/joey/.ssh/identity type -1
> debug1: identity file /home/joey/.ssh/id_rsa type -1
> debug1: identity file /home/joey/.ssh/id_dsa type 2
> ssh_exchange_identification: Connection closed by remote host
>
> Telnet also hung up before I got to a login prompt. The rest of the
serivces
> seemed ok. I got a root shell via other means, and tried restarting ssh. No
> luck. Tried upgrading the whole system to current unstable, again, no luck.
> Then I noticed something strange in ps:
>
> 14515 ? S 0:00 sshd: joey [pam]
> 32215 ? S 0:00 sshd: bdragon [pam]
> 8978 ? S 0:00 sshd: joeyh [pam]
>
> There were a few more that I've elided because they may contain
preveligded
> information. I don't have a "bdragon" or "joeyh"
user, and there were some
> other weird users listed. None of these users were really logged in,
> that I could tell.
We're also seeing these symptoms on a server at work, although they're
highly intermittent and very difficult to track down. Debian ssh
3.8.1p1-4 is basically OpenSSH 3.8.1p1 plus Darren Tucker's auth-pam.c
patch to kill the PAM thread if the privsep slave dies plus a few other
changes which I'm pretty sure are unrelated. In all cases where it goes
wrong, the [pam] processes are left lying around either after attempting
to log in as a nonexistent user or Ctrl-Cing ssh at a Password: prompt.
We're running with UsePrivilegeSeparation yes, UsePAM yes, and
PasswordAuthentication no.
We noticed this at the end of a diff of auth.log output between when the
[pam] processes were left lying around and when they aren't:
debug3: ssh_msg_send: type 1
debug3: ssh_msg_recv entering
debug3: mm_request_send entering: type 51
debug3: mm_request_receive entering
- debug1: do_cleanup
fatal: PAM: authentication thread exited unexpectedly
debug1: do_cleanup
+ debug1: PAM: cleanup
+ debug3: PAM: sshpam_thread_cleanup entering
It looks to me as if sshpam_cleanup() and sshpam_thread_cleanup() aren't
getting called under all circumstances when they should be, and that the
result of this is that the [pam] threads lie around forever until they
choke the server. Yet do_cleanup() *is* getting called. Since I believe
that neither KRB5 nor GSSAPI is compiled in, this means that either:
(a) we're in the login shell child (should certainly hope not,
authentication fails)
(b) do_cleanup() has been called already in this process
(c) authctxt is NULL (which I don't think can be possible, since
do_cleanup() must be getting called from cleanup_exit())
So I think I see the problem: if do_cleanup() happens to get called from
the "wrong" thread (perhaps the authentication thread itself?), then
it
doesn't manage to do all the cleanup but nevertheless sets called to 1,
and when the main thread comes along later it doesn't do the PAM
cleanup.
I wish I could provide you with a reliable reproduction recipe, but
perhaps this is good enough for a pthreads expert on openssh-unix-dev to
work it out?
Thanks,
--
Colin Watson [cjwatson at flatline.org.uk]