Under certain circumstances (repeatable with a workaround) the client in openssh-2.1.1p3 and p4 closes file descriptors and then calls select() with the stderr one in the write fd_set. The circumstances which cause this appears to be that the closing of stdin/stdout/stderr occurs before the last of the stderr data is written to stderr. This occurs when a tty is not allocated, but the error occurs on the client side. So apparently is it perhaps the timing or order of data coming from the server that triggers this. This occurs on platforms Solaris 7, Slackware 7.0, Slackare 3.4, and Redhat 6.0 with all of them being used as either client or server in various combinations. In all cases protocol version 2 is configured. Here is a simple example with Slackware 7.0 as client and server: phil at procyon:/home/phil 1311> ssh izar 'ls this_file_does_not_exist' ls: select: Bad file descriptor phil at procyon:/home/phil 1312> ssh izar 'ls this_file_does_not_exist;sleep 1' ls: this_file_does_not_exist: No such file or directory phil at procyon:/home/phil 1313> Another example with Solaris 7 client and Redhat 6.0 server: phil at sirius:/home/phil 57> ssh mira 'ls this_file_does_not_exist' ls: select: Bad file number phil at sirius:/home/phil 58> ssh mira 'ls this_file_does_not_exist;sleep 1' ls: this_file_does_not_exist: No such file or directory phil at sirius:/home/phil 59> The problem also occurs when client and server are the same machine, so physical network timings aren't expected to be the trigger: phil at procyon:/home/phil 1315> ssh procyon 'ls this_file_does_not_exist' ls: select: Bad file descriptor phil at procyon:/home/phil 1316> I did strace of ssh -v and discovered the following syscall events: close(6) = 0 select(7, [3], [3 6], NULL, NULL) = -1 EBADF (Bad file descriptor) occurred in the failing case. Notice the 6 in the write fd_set (3rd arg). The successful case (using the 1 second sleep) looked like: close(6) = 0 select(7, [3], [3], NULL, NULL) = 1 (out [3]) So regardless of any failings that may exist on the server side, the client is clearly doing the wrong thing at times with respect to the building of the write fd_set for select(). I'm too unfamiliar with the organization of the code (it's jumping around to too many different functions for me to keep track of in clientloop.c) to really figure out exactly why this is happening. I can just see that it is definitely happening At first I thought the bug was on the server side, so I was doing strace of sshd -d to see what was happening. There definitely is a difference in the sequence of events in the server side for the failing and successful cases. This may be triggering the problem on the client side, or just be the result of it; I don't know. Here's documentation I have captured: The "failure" and "success" names are the failure and success cases. The "combine" is the failure and success cases interleaved with the difference set aside as its own block of lines. Server straces of sshd -d: http://phil.ipal.org/openssh/ssh-strace-servers-combine.txt http://phil.ipal.org/openssh/ssh-strace-servers-failure.txt http://phil.ipal.org/openssh/ssh-strace-servers-success.txt Client straces of ssh -v: http://phil.ipal.org/openssh/ssh-strace-clients-combine.txt http://phil.ipal.org/openssh/ssh-strace-clients-failure.txt http://phil.ipal.org/openssh/ssh-strace-clients-success.txt In the combine files, the indicator "S-" is on each line from the success case, and "-F" is on each line from the failure case. The blocks of differences are set apart with a row of 77 equal signs. The interesting parts are at near the bottom of each file, but the whole thing is included to make sure all relevant information is there. I hope someone who understands the organization of the client code can figure out the cause. Since I'm in the USA I can't contribute back a patch even if I do find it. Again, this is all protocol version 2 as I have both clients and servers configured to do version 2 only and all keys are DSA. If you are having trouble reproducing it, it does not always occur. Give it several tries. Another factor that may be involved is that I have no passphrase for the key (but I don't really expect this to be relevant). -- | Phil Howard - KA9WGN | My current websites: linuxhomepage.com, ham.org | phil (at) ipal.net +---------------------------------------------------- | Dallas - Texas - USA | phil-evaluates-email-ads-750-dollars-each at ipal.net
