Damien Miller
2022-Jan-11 04:00 UTC
OS X poll breakage (Was: Please help test recent changes)
On Mon, 10 Jan 2022, Darren Tucker wrote:> On Fri, Jan 07, 2022 at 09:52:09AM +1100, Damien Miller wrote: > > 1. Conversion of the ssh and sshd mainloop from select() to poll() > > > > This should be entirely invisible to users, so any behaviour change > > is a bug. If you see something and want to help debug it further, > > uncomment the DEBUG_CHANNEL_POLL #define in channels.c for helps of > > extra debug logging. > > This change breaks (at least) the exit-status test on Mac OS X. I > haven't had a chance to investigate yet, > > $ git bisect run sh -c "autoreconf && ./configure --without-openssl && > make t-exec LTESTS=exit-status" > [... lots ...] > 17877bc81db3846e6e7d4cfb124d966bb9c9296b is the first bad commit > commit 17877bc81db3846e6e7d4cfb124d966bb9c9296b > Author: djm at openbsd.org <djm at openbsd.org> > Date: Thu Jan 6 21:48:38 2022 +0000 > > upstream: convert ssh, sshd mainloops from select() to poll(); > > feedback & ok deraadt@ and markus@ has been in snaps for a few months > > OpenBSD-Commit-ID: a77e16a667d5b194dcdb3b76308b8bba7fa7239c > > $ uname -a > Darwin osx-highsierra 17.7.0 Darwin Kernel Version 17.7.0: Fri Oct 30 > 13:34:27 PDT 2020; root:xnu-4570.71.82.8~1/RELEASE_X86_64 x86_64Here's the client side log of the failure. It comes from the "same with early close of stdout/err" section of the test, but I can't actually see anything get closed... debug3: receive packet: type 91 debug2: channel_input_open_confirmation: channel 0: callback start debug2: client_session2_setup: id 0 debug1: Sending command: exec sh -c 'sleep 2; exec > /dev/null 2>&1; sleep 3; exit 0' debug2: channel 0: request exec confirm 1 debug3: send packet: type 98 debug2: channel_input_open_confirmation: channel 0: callback done debug2: channel 0: open confirm rwindow 0 rmax 32768 debug2: channel 0: rcvd adjust 2097152 debug3: receive packet: type 99 debug2: channel_input_status_confirm: type 99 id 0 debug2: exec request accepted on channel 0 channel 0: invalid rfd pollfd[2].fd 4 r4 w7 e8 s-1 and (separate run) with DEBUG_CHANNEL_POLL set: debug2: channel_input_status_confirm: type 99 id 0 debug2: exec request accepted on channel 0 debug3: dump_channel_poll: channel 0: rfd r4 w7 e8 s-1 pfd[2].fd=4 want 0x01 ev 0x01 ready 0x00 rev 0x00 debug3: dump_channel_poll: channel 0: rfd r4 w7 e8 s-1 pfd[3].fd=7 want 0x01 ev 0x00 ready 0x00 rev 0x00 debug3: dump_channel_poll: channel 0: rfd r4 w7 e8 s-1 pfd[4].fd=8 want 0x01 ev 0x00 ready 0x00 rev 0x00 debug1: channel_after_poll: pfd[2].fd 4 rev 0x0020 debug3: dump_channel_poll: channel 0: rfd r4 w7 e8 s-1 pfd[2].fd=4 want 0x01 ev 0x01 ready 0x00 rev 0x20 channel 0: invalid rfd pollfd[2].fd 4 r4 w7 e8 s-1 FAIL: exit code (with sleep) mismatch for: 255 != 0 It looks like it fails with HAVE_POLL set to 0 too. Still looking... -d
Damien Miller
2022-Jan-11 07:30 UTC
OS X poll breakage (Was: Please help test recent changes)
On Tue, 11 Jan 2022, Damien Miller wrote:> Here's the client side log of the failure. It comes from the > "same with early close of stdout/err" section of the test, but I can't > actually see anything get closed... > > debug3: receive packet: type 91 > debug2: channel_input_open_confirmation: channel 0: callback start > debug2: client_session2_setup: id 0 > debug1: Sending command: exec sh -c 'sleep 2; exec > /dev/null 2>&1; sleep 3; exit 0' > debug2: channel 0: request exec confirm 1 > debug3: send packet: type 98 > debug2: channel_input_open_confirmation: channel 0: callback done > debug2: channel 0: open confirm rwindow 0 rmax 32768 > debug2: channel 0: rcvd adjust 2097152 > debug3: receive packet: type 99 > debug2: channel_input_status_confirm: type 99 id 0 > debug2: exec request accepted on channel 0 > channel 0: invalid rfd pollfd[2].fd 4 r4 w7 e8 s-1 > > and (separate run) with DEBUG_CHANNEL_POLL set: > > debug2: channel_input_status_confirm: type 99 id 0 > debug2: exec request accepted on channel 0 > debug3: dump_channel_poll: channel 0: rfd r4 w7 e8 s-1 pfd[2].fd=4 want 0x01 ev 0x01 ready 0x00 rev 0x00 > debug3: dump_channel_poll: channel 0: rfd r4 w7 e8 s-1 pfd[3].fd=7 want 0x01 ev 0x00 ready 0x00 rev 0x00 > debug3: dump_channel_poll: channel 0: rfd r4 w7 e8 s-1 pfd[4].fd=8 want 0x01 ev 0x00 ready 0x00 rev 0x00 > debug1: channel_after_poll: pfd[2].fd 4 rev 0x0020 > debug3: dump_channel_poll: channel 0: rfd r4 w7 e8 s-1 pfd[2].fd=4 want 0x01 ev 0x01 ready 0x00 rev 0x20 > channel 0: invalid rfd pollfd[2].fd 4 r4 w7 e8 s-1 > FAIL: exit code (with sleep) mismatch for: 255 != 0 > > It looks like it fails with HAVE_POLL set to 0 too. > > Still looking...Wow, it looks like Darwin's poll(2) is completely broken for character- special devices (at least). E.g. the attached program spins shows similar behaviour when run on /dev/null - it spins, returning revents=POLLNVAL. It looks like I'm not the first to see this either, some people noticed it 17 years ago! https://lists.apple.com/archives/darwin-dev/2005/May/msg00220.html There's apparently a bug open (Apple bug 3710161), but I can't see it and if they haven't fixed it by now then they're presumably not in any great hurry. Unsetting HAVE_POLL lets the test pass. It seems like some other programs use a similar approach, e.g. https://www.mail-archive.com/bug-gnulib at gnu.org/msg00296.html So I think we need a HAVE_BROKEN_POLL :( -d