David Batterham
2001-Oct-10 03:13 UTC
OpenSSH solaris: bad return code after exec of remote command
Hi OpenSSH developers, I am using openSSH (now 2.9.9p2, but prob occurs in 2.9p2 also) to execute commands on a remote machine which outputs data to stdout then pipes it to another invocation of ssh which connects back to the first machine in the same way, where it starts a program to read and store the output from the command on the second machine. I am using the "command" option in the keys file to force execution of a particular command when that key is used for authentication. My problem is this. The remote ssh returns an error code of 255 (using echo $?), or (-1 in the debug), despite the command executing successfully. about 5-10% of the time it returns 0. In all cases it "appears" the command completed successully, and is simply an issue of ssh failing to close the channel cleanly. My guess is that one side is closing before the other acknowledges and therefore thinks the channel has been prematurely terminated. My questions are: How sure can I be that I am indeed getting all data from the pipe? How can I accurately determine the exit status of the command (ufsdump) which is piping data into ssh, regardless of ssh's exit status ? Has anyone else seen this problem and/or resolved it ? If more debug would be helpful, please let me know... Below is the debug from the remote side when it fails: <snip> debug1: Remote: Forced command: /les/galaxy/sysadmin/backup/sbin/receiver debug1: input_userauth_pk_ok: pkalg ssh-dss blen 435 lastkey 12bac0 hint 0 debug1: read PEM private key done: type DSA debug1: sig size 20 20 debug1: Remote: Forced command: /les/galaxy/sysadmin/backup/sbin/receiver debug1: ssh-userauth2 successful: method publickey debug1: fd 5 setting O_NONBLOCK debug1: fd 7 setting O_NONBLOCK debug1: fd 8 setting O_NONBLOCK debug1: channel 0: new [client-session] debug1: channel_new: 0 debug1: send channel open 0 debug1: Entering interactive session. debug1: client_init id 0 arg 0 debug1: channel request 0: shell debug1: channel 0: open confirm rwindow 0 rmax 16384 debug1: channel 0: read<=0 rfd 5 len 0 debug1: channel 0: read failed debug1: channel 0: input open -> drain debug1: channel 0: close_read debug1: channel 0: input: no drain shortcut debug1: channel 0: ibuf empty debug1: channel 0: input drain -> closed debug1: channel 0: send eof debug1: channel 0: rcvd eof debug1: channel 0: output open -> drain debug1: channel 0: rcvd close debug1: channel 0: obuf empty debug1: channel 0: output drain -> closed debug1: channel 0: close_write debug1: channel 0: send close debug1: channel 0: is dead debug1: channel_free: channel 0: status: The following connections are open: #0 client-session (t4 r0 i8/0 o128/0 fd -1/-1) debug1: channel_free: channel 0: dettaching channel user debug1: Transferred: stdin 0, stdout 0, stderr 0 bytes in 15.3 seconds debug1: Bytes per second: stdin 0.0, stdout 0.0, stderr 0.0 debug1: Exit status -1 This is the debug when it works: OpenSSH_2.9p2, SSH protocols 1.5/2.0, OpenSSL 0x0090602f debug1: Reading configuration data /opt/OBSDssh/etc/ssh_config debug1: Seeding random number generator debug1: Rhosts Authentication disabled, originating port will not be trusted. debug1: restore_uid debug1: ssh_connect: getuid 0 geteuid 0 anon 1 debug1: Connecting to galaxy [my_ip_was_here] port 22. debug1: temporarily_use_uid: 0/1 (e=0) debug1: restore_uid debug1: temporarily_use_uid: 0/1 (e=0) debug1: restore_uid debug1: Connection established. debug1: read PEM private key done: type DSA debug1: read PEM private key done: type RSA debug1: identity file /.ssh/backup-sender type 2 debug1: Remote protocol version 2.0, remote software version OpenSSH_2.9p2 debug1: match: OpenSSH_2.9p2 pat ^OpenSSH Enabling compatibility mode for protocol 2.0 debug1: Local version string SSH-2.0-OpenSSH_2.9p2 debug1: SSH2_MSG_KEXINIT sent debug1: SSH2_MSG_KEXINIT received debug1: kex: server->client aes128-cbc hmac-md5 none debug1: kex: client->server aes128-cbc hmac-md5 none debug1: SSH2_MSG_KEX_DH_GEX_REQUEST sent debug1: expecting SSH2_MSG_KEX_DH_GEX_GROUP debug1: dh_gen_key: priv key bits set: 135/256 debug1: bits set: 1017/2049 debug1: SSH2_MSG_KEX_DH_GEX_INIT sent debug1: expecting SSH2_MSG_KEX_DH_GEX_REPLY debug1: Host 'galaxy' is known and matches the RSA host key. debug1: Found key in //.ssh/known_hosts2:1 debug1: bits set: 1027/2049 debug1: ssh_rsa_verify: signature correct debug1: kex_derive_keys debug1: newkeys: mode 1 debug1: SSH2_MSG_NEWKEYS sent debug1: waiting for SSH2_MSG_NEWKEYS debug1: newkeys: mode 0 debug1: SSH2_MSG_NEWKEYS received debug1: done: ssh_kex2. debug1: send SSH2_MSG_SERVICE_REQUEST debug1: service_accept: ssh-userauth debug1: got SSH2_MSG_SERVICE_ACCEPT debug1: authentications that can continue: publickey,password,keyboard-interactive debug1: next auth method to try is publickey debug1: try pubkey: /.ssh/backup-sender debug1: Remote: Forced command: /les/galaxy/sysadmin/backup/sbin/receiver debug1: input_userauth_pk_ok: pkalg ssh-dss blen 435 lastkey 12bac0 hint 0 debug1: read PEM private key done: type DSA debug1: sig size 20 20 debug1: Remote: Forced command: /les/galaxy/sysadmin/backup/sbin/receiver debug1: ssh-userauth2 successful: method publickey debug1: fd 5 setting O_NONBLOCK debug1: fd 7 setting O_NONBLOCK debug1: fd 8 setting O_NONBLOCK debug1: channel 0: new [client-session] debug1: channel_new: 0 debug1: send channel open 0 debug1: Entering interactive session. debug1: client_init id 0 arg 0 debug1: channel request 0: shell debug1: channel 0: open confirm rwindow 0 rmax 16384 debug1: channel 0: read<=0 rfd 5 len 0 debug1: channel 0: read failed debug1: channel 0: input open -> drain debug1: channel 0: close_read debug1: channel 0: input: no drain shortcut debug1: channel 0: ibuf empty debug1: channel 0: input drain -> closed debug1: channel 0: send eof debug1: client_input_channel_req: channel 0 rtype exit-status reply 0 debug1: channel 0: rcvd eof debug1: channel 0: output open -> drain debug1: channel 0: rcvd close debug1: channel 0: obuf empty debug1: channel 0: output drain -> closed debug1: channel 0: close_write debug1: channel 0: send close debug1: channel 0: is dead debug1: channel_free: channel 0: status: The following connections are open: #0 client-session (t4 r0 i8/0 o128/0 fd -1/-1) debug1: channel_free: channel 0: dettaching channel user debug1: Transferred: stdin 0, stdout 0, stderr 0 bytes in 16.3 seconds debug1: Bytes per second: stdin 0.0, stdout 0.0, stderr 0.0 debug1: Exit status 0 Thanks, Dave -- --------------------------------------------------------------- David Batterham PHONE: +61 3 95001017 System Administrator FAX: +61 3 95001191 Lochard Pty Ltd EMAIL: david at lochard.com.au
Matthew Seaman
2001-Oct-10 10:09 UTC
OpenSSH solaris: bad return code after exec of remote command
David Batterham wrote:> I am using openSSH (now 2.9.9p2, but prob occurs in 2.9p2 also) to execute > commands on a remote machine which outputs data to stdout then pipes it to > another invocation of ssh which connects back to the first machine in the same > way, where it starts a program to read and store the output from the command on > the second machine. I am using the "command" option in the keys file to force > execution of a particular command when that key is used for authentication. > > My problem is this. The remote ssh returns an error code of 255 (using echo $?), > or (-1 in the debug), despite the command executing successfully. about 5-10% of > the time it returns 0. In all cases it "appears" the command completed > successully, and is simply an issue of ssh failing to close the channel cleanly. > My guess is that one side is closing before the other acknowledges and therefore > thinks the channel has been prematurely terminated. > > My questions are: How sure can I be that I am indeed getting all data from the > pipe? How can I accurately determine the exit status of the command (ufsdump) > which is piping data into ssh, regardless of ssh's exit status ? Has anyone else > seen this problem and/or resolved it ?Damn. Synchronicity or what. We've just been struggling with a similar problem and I was just about to post a bug report when I saw this message. What we've been able to determine: You'll always get a failure exit code from ssh if the following three things are true: i) SSH2 protocol ii) Using a forced command iii) No stdin on the client --- ie. using `ssh -n' or `ssh host commands </dev/null'. Work around is to use rsa1 keys and SSH1.5 protocol. Verified using OpenSSH 2.3.0p1, 2.9p1, 2.9.9p2 on Solaris 8. We'll be glad to supply more detailed debugging output or whatever on request. Matthew -- Matthew Seaman 01628 498661 Abeo, abeo, abeo, actum est, comites!
Aaron Bush
2001-Oct-10 14:19 UTC
OpenSSH solaris: bad return code after exec of remote command
David Batterham wrote:> > My problem is this. The remote ssh returns an error code of 255 (using echo $?), > or (-1 in the debug), despite the command executing successfully. about 5-10% of > the time it returns 0. In all cases it "appears" the command completed > successully, and is simply an issue of ssh failing to close the channel cleanly. > My guess is that one side is closing before the other acknowledges and therefore > thinks the channel has been prematurely terminated.I too have had this problem. Using 2.9p2 a call such as: $ ssh host "rm file" will sometimes return 255 and other times return 0. when the return value is not 0 i ssh back to host and confirm that the file is 'really' removed, and in all cases the file had been removed. -ab
it's easy to reproduce your problem: ssh -n localhost 'echo x; exec > /dev/null 2>&1; sleep 3; exit 5;' will always have an exit status of 255 (the default for the client) instead of 5 because the sshd sends a CHANNEL CLOSE message (since all pipes are closed) before the shell exits. so a "exit-status" messages cannot be sent because the channel is gone at this point. i'm not sure how to delay the CLOSE message.
Paul Menage
2001-Oct-10 19:56 UTC
OpenSSH solaris: bad return code after exec of remote command
> My problem is this. The remote ssh returns an error code of 255 (using >echo $?), or (-1 in the debug), despite the command executing >successfully. about 5-10% of the time it returns 0. In all cases it >"appears" the command completed successully, and is simply an issue of >ssh failing to close the channel cleanly.We've seen this intermittently on Windows - we put it down to bugs in cygwin, as executing the same scripts (on the server) on Solaris or Linux has never produced the problem for us. Paul
Markus Friedl
2001-Oct-10 20:22 UTC
OpenSSH solaris: bad return code after exec of remote command
I wrote:> it's easy to reproduce your problem: > > ssh -n localhost 'echo x; exec > /dev/null 2>&1; sleep 3; exit 5;'could you please try this diff against the current CVS hope everything else stil works... -m Index: channels.c ==================================================================RCS file: /cvs/openssh_cvs/channels.c,v retrieving revision 1.110 diff -u -r1.110 channels.c --- channels.c 2001/10/10 05:14:38 1.110 +++ channels.c 2001/10/10 20:26:54 @@ -331,10 +331,6 @@ debug3("channel_free: status: %s", s); xfree(s); - if (c->detach_user != NULL) { - debug("channel_free: channel %d: detaching channel user", c->self); - c->detach_user(c->self, NULL); - } if (c->sock != -1) shutdown(c->sock, SHUT_RDWR); channel_close_fds(c); @@ -1520,6 +1516,30 @@ channel_handler_init_15(); } +/* gc dead channels */ +static void +channel_garbage_collect(Channel *c) +{ + if (c == NULL) + return; + debug("channel %d: gc called", c->self); + if (c->detach_user != NULL) { + if (!chan_is_dead(c, 0)) + return; + debug("channel %d: gc: notify user", c->self); + c->detach_user(c->self, NULL); + /* if we still have a callback */ + if (c->detach_user != NULL) + return; + debug("channel %d: gc: user removed callback", c->self); + } + debug("channel %d: gc: after notify user", c->self); + if (!chan_is_dead(c, 1)) + return; + debug("channel %d: garbage collecting", c->self); + channel_free(c); +} + static void channel_handler(chan_fn *ftab[], fd_set * readset, fd_set * writeset) { @@ -1537,24 +1557,7 @@ continue; if (ftab[c->type] != NULL) (*ftab[c->type])(c, readset, writeset); - if (chan_is_dead(c)) { - /* - * we have to remove the fd's from the select mask - * before the channels are free'd and the fd's are - * closed - */ - if (c->wfd != -1) - FD_CLR(c->wfd, writeset); - if (c->rfd != -1) - FD_CLR(c->rfd, readset); - if (c->efd != -1) { - if (c->extended_usage == CHAN_EXTENDED_READ) - FD_CLR(c->efd, readset); - if (c->extended_usage == CHAN_EXTENDED_WRITE) - FD_CLR(c->efd, writeset); - } - channel_free(c); - } + channel_garbage_collect(c); } } @@ -1625,7 +1628,7 @@ if (compat20 && (c->flags & (CHAN_CLOSE_SENT|CHAN_CLOSE_RCVD))) { /* XXX is this true? */ - debug2("channel %d: no data after CLOSE", c->self); + debug3("channel %d: will not send data after CLOSE", c->self); continue; } Index: channels.h ==================================================================RCS file: /cvs/openssh_cvs/channels.h,v retrieving revision 1.41 diff -u -r1.41 channels.h --- channels.h 2001/10/10 05:14:39 1.41 +++ channels.h 2001/10/10 20:26:54 @@ -214,7 +214,7 @@ /* channel close */ -int chan_is_dead(Channel *); +int chan_is_dead(Channel *, int); void chan_mark_dead(Channel *); void chan_init_iostates(Channel *); void chan_init(void); Index: clientloop.c ==================================================================RCS file: /cvs/openssh_cvs/clientloop.c,v retrieving revision 1.65 diff -u -r1.65 clientloop.c --- clientloop.c 2001/09/18 05:51:14 1.65 +++ clientloop.c 2001/10/10 20:26:57 @@ -753,6 +753,7 @@ if (id != session_ident) error("client_channel_closed: id %d != session_ident %d", id, session_ident); + channel_cancel_cleanup(id); session_closed = 1; if (in_raw_mode()) leave_raw_mode(); Index: nchan.c ==================================================================RCS file: /cvs/openssh_cvs/nchan.c,v retrieving revision 1.28 diff -u -r1.28 nchan.c --- nchan.c 2001/09/20 19:33:33 1.28 +++ nchan.c 2001/10/10 20:26:58 @@ -432,7 +432,7 @@ } int -chan_is_dead(Channel *c) +chan_is_dead(Channel *c, int send) { if (c->type == SSH_CHANNEL_ZOMBIE) { debug("channel %d: zombie", c->self); @@ -461,7 +461,14 @@ "read": "write"); } else { if (!(c->flags & CHAN_CLOSE_SENT)) { - chan_send_close2(c); + if (send) { + chan_send_close2(c); + } else { + if (c->flags & CHAN_CLOSE_RCVD) { + debug("channel %d: XXX almost dead", c->self); + return 1; + } + } } if ((c->flags & CHAN_CLOSE_SENT) && (c->flags & CHAN_CLOSE_RCVD)) { Index: serverloop.c ==================================================================RCS file: /cvs/openssh_cvs/serverloop.c,v retrieving revision 1.80 diff -u -r1.80 serverloop.c --- serverloop.c 2001/10/10 05:14:39 1.80 +++ serverloop.c 2001/10/10 20:26:59 @@ -208,9 +208,6 @@ max_time_milliseconds = options.client_alive_interval * 1000; } - /* When select fails we restart from here. */ -retry_select: - /* Allocate and update select() masks for channel descriptors. */ channel_prepare_select(readsetp, writesetp, maxfdp, nallocp, 0); @@ -275,10 +272,11 @@ ret = select((*maxfdp)+1, *readsetp, *writesetp, NULL, tvp); if (ret == -1) { + memset(*readsetp, 0, *maxfdp); + memset(*writesetp, 0, *maxfdp); if (errno != EINTR) error("select: %.100s", strerror(errno)); - else - goto retry_select; + debug("XXX select: %.100s", strerror(errno)); } if (ret == 0 && client_alive_scheduled) client_alive_check(); @@ -668,13 +666,30 @@ /* NOTREACHED */ } +static void +collect_children(void) +{ + pid_t pid; + sigset_t oset, nset; + int status; + + /* block SIGCHLD while we check for dead children */ + sigemptyset(&nset); + sigaddset(&nset, SIGCHLD); + sigprocmask(SIG_BLOCK, &nset, &oset); + if (child_terminated) { + while ((pid = waitpid(-1, &status, WNOHANG)) > 0) + session_close_by_pid(pid, status); + child_terminated = 0; + } + sigprocmask(SIG_SETMASK, &oset, NULL); +} + void server_loop2(Authctxt *authctxt) { fd_set *readset = NULL, *writeset = NULL; - int rekeying = 0, max_fd, status, nalloc = 0; - pid_t pid; - sigset_t oset, nset; + int rekeying = 0, max_fd, nalloc = 0; debug("Entering interactive session for SSH2."); @@ -698,16 +713,7 @@ wait_until_can_do_something(&readset, &writeset, &max_fd, &nalloc, 0); - /* block SIGCHLD while we check for dead children */ - sigemptyset(&nset); - sigaddset(&nset, SIGCHLD); - sigprocmask(SIG_BLOCK, &nset, &oset); - if (child_terminated) { - while ((pid = waitpid(-1, &status, WNOHANG)) > 0) - session_close_by_pid(pid, status); - child_terminated = 0; - } - sigprocmask(SIG_SETMASK, &oset, NULL); + collect_children(); if (!rekeying) channel_after_select(readset, writeset); process_input(readset); @@ -715,6 +721,8 @@ break; process_output(writeset); } + collect_children(); + if (readset) xfree(readset); if (writeset) @@ -722,11 +730,6 @@ /* free all channels, no more reads and writes */ channel_free_all(); - - /* collect remaining dead children, XXX not necessary? */ - mysignal(SIGCHLD, SIG_DFL); - while ((pid = waitpid(-1, &status, WNOHANG)) > 0) - session_close_by_pid(pid, status); /* close remaining sessions, e.g remove wtmp entries */ session_close_all(); Index: session.c ==================================================================RCS file: /cvs/openssh_cvs/session.c,v retrieving revision 1.152 diff -u -r1.152 session.c --- session.c 2001/10/10 05:14:39 1.152 +++ session.c 2001/10/10 20:27:03 @@ -1961,18 +1961,15 @@ debug("session_close_by_channel: no session for channel %d", id); return; } - /* disconnect channel */ - channel_cancel_cleanup(s->chanid); - s->chanid = -1; - - debug("session_close_by_channel: channel %d kill %d", id, s->pid); + debug("session_close_by_channel: channel %d child %d", id, s->pid); if (s->pid != 0) { - /* notify child */ - if (kill(s->pid, SIGHUP) < 0) - error("session_close_by_channel: kill %d: %s", - s->pid, strerror(errno)); + debug("session_close_by_channel: channel %d: has child", id); + return; } + channel_cancel_cleanup(s->chanid); + s->chanid = -1; session_close(s); + return; } void
Markus Friedl
2001-Oct-11 15:18 UTC
OpenSSH solaris: bad return code after exec of remote command
I wrote:> could you please try this diff against the current CVSso it works?