Jordan Ritter
2010-Oct-02 16:38 UTC
Problem with binding UNIX listeners before checking PID
Howdy. I have lately been frustrated by the following use case: 1. Run nginx/unicorn in production, listening on a UNIX socket with a defined pid file. Things run good. 2. Someone pushes code, unicorn restarts just fine, workers are all up and running. 3. But someone is suspicious, or maybe they forget which box they''re logged into, so they invoke unicorn manually. Same directory, same settings. 4. It looks like the pid file check kicked in, because unicorn refuses to boot - hey, it''s already running, bugger off. great. 5. BUT, this happened *after* the listener processing: the manually-invoked unicorn unlinks the real unicorn master''s unix listener, so it''s left dead in the water and everybody loses. unicorn master doesn''t know its listener is actually gone (but lsof shows open unix socket fd, netstat shows unix socket still present, so cursory investigation is misleading), but nginx keeps spewing ECONNREFUSEDs because the unix socket it''s hitting belongs to that accidental unicorn instance that already decided not to stick around. I think this is effectively about a behavioral difference in Unicorn::SocketHelper#bind_listen around the handling of UNIX vs. TCP sockets (this doesn''t happen with TCP sockets because there''s no unlink/disconnect step), and the fact that HttpServer#start evaluates the listener config before the PID path/config. Now I see comments in and around HttpServer#initialize talking about races wrt binding to the listener and whatnot, and being newish to the codebase I admit I haven''t yet fully absorbed all the considerations at play. But I think it''s fair to say that killing the listener(s) (in the UNIX socket case) before discovering you shouldn''t have run in the first place (from the PID file) qualifies as buggy/bad/broken behavior. I might suggest simply swapping their processing order in #start, but given the complexity of in-place restarts and other race considerations, I have doubts solving this would be that easy. Any thoughts/ideas? cheers, --jordan
Jordan Ritter <jpr5 at darkridge.com> wrote:> Howdy. > > I have lately been frustrated by the following use case: > > 1. Run nginx/unicorn in production, listening on a UNIX socket > with a defined pid file. Things run good. > 2. Someone pushes code, unicorn restarts just fine, workers are > all up and running. > 3. But someone is suspicious, or maybe they forget which > box they''re logged into, so they invoke unicorn manually. > Same directory, same settings. > > 4. It looks like the pid file check kicked in, because unicorn > refuses to boot - hey, it''s already running, bugger off. great. > 5. BUT, this happened *after* the listener processing: the > manually-invoked unicorn unlinks the real unicorn master''s unix > listener, so it''s left dead in the water and everybody loses. > > unicorn master doesn''t know its listener is actually gone (but lsof shows > open unix socket fd, netstat shows unix socket still present, so cursory > investigation is misleading), but nginx keeps spewing ECONNREFUSEDs > because the unix socket it''s hitting belongs to that accidental unicorn > instance that already decided not to stick around. > > I think this is effectively about a behavioral difference in > Unicorn::SocketHelper#bind_listen around the handling of UNIX vs. TCP > sockets (this doesn''t happen with TCP sockets because there''s no > unlink/disconnect step), and the fact that HttpServer#start evaluates > the listener config before the PID path/config. > > Now I see comments in and around HttpServer#initialize talking about races > wrt binding to the listener and whatnot, and being newish to the codebase > I admit I haven''t yet fully absorbed all the considerations at play. > > But I think it''s fair to say that killing the listener(s) (in the UNIX > socket case) before discovering you shouldn''t have run in the first place > (from the PID file) qualifies as buggy/bad/broken behavior.Hi Jordan, Thanks for the detailed bug report. I knew from experience with other daemons that lingering UNIX sockets caused troubles for some users, but I failed to take into account the case where a user mistakenly starting the process twice. Yes, getting pid file writing/ordering "right"[1] is very tricky.> I might suggest simply swapping their processing order in #start, but > given the complexity of in-place restarts and other race considerations, > I have doubts solving this would be that easy.That wouldn''t work if pid files weren''t in use at all.> Any thoughts/ideas?A simpler check would be to use connect(2) (but not make any HTTP request) to see if the socket is alive. Patch coming. [1] - I don''t believe there actually is a way to always be right, just less bad/broken than the alternatives. -- Eric Wong
While we''ve always unlinked dead sockets from nuked/leftover processes, blindly unlinking them can cause unnecessary failures when an active process is already listening on them. We now make a simple connect(2) check to ensure the socket is not in use before unlinking it. Thanks to Jordan Ritter for the detailed bug report leading to this fix. ref: http://mid.gmane.org/8D95A44B-A098-43BE-B532-7D74BD957F31 at darkridge.com --- Eric Wong <normalperson at yhbt.net> wrote: > A simpler check would be to use connect(2) (but not make any HTTP request) > to see if the socket is alive. Patch coming. s/simpler/better/ Also pushed out to "master". I guess a 1.1.4 release with this fix only is on the way since there isn''t much else to release, yet. lib/unicorn/socket_helper.rb | 10 ++++- t/t0011-active-unix-socket.sh | 79 +++++++++++++++++++++++++++++++++++++++ test/unit/test_socket_helper.rb | 9 ++++- 3 files changed, 95 insertions(+), 3 deletions(-) create mode 100644 t/t0011-active-unix-socket.sh diff --git a/lib/unicorn/socket_helper.rb b/lib/unicorn/socket_helper.rb index 9a155e1..1d03eab 100644 --- a/lib/unicorn/socket_helper.rb +++ b/lib/unicorn/socket_helper.rb @@ -111,8 +111,14 @@ module Unicorn sock = if address[0] == ?/ if File.exist?(address) if File.socket?(address) - logger.info "unlinking existing socket=#{address}" - File.unlink(address) + begin + UNIXSocket.new(address).close + # fall through, try to bind(2) and fail with EADDRINUSE + # (or succeed from a small race condition we can''t sanely avoid). + rescue Errno::ECONNREFUSED + logger.info "unlinking existing socket=#{address}" + File.unlink(address) + end else raise ArgumentError, "socket=#{address} specified but it is not a socket!" diff --git a/t/t0011-active-unix-socket.sh b/t/t0011-active-unix-socket.sh new file mode 100644 index 0000000..6f9ac53 --- /dev/null +++ b/t/t0011-active-unix-socket.sh @@ -0,0 +1,79 @@ +#!/bin/sh +. ./test-lib.sh +t_plan 11 "existing UNIX domain socket check" + +read_pid_unix () { + x=$(printf ''GET / HTTP/1.0\r\n\r\n'' | \ + socat - UNIX:$unix_socket | \ + tail -1) + test -n "$x" + y="$(expr "$x" : ''\([0-9]\+\)'')" + test x"$x" = x"$y" + test -n "$y" + echo "$y" +} + +t_begin "setup and start" && { + rtmpfiles unix_socket unix_config + rm -f $unix_socket + unicorn_setup + grep -v ^listen < $unicorn_config > $unix_config + echo "listen ''$unix_socket''" >> $unix_config + unicorn -D -c $unix_config pid.ru + unicorn_wait_start + orig_master_pid=$unicorn_pid +} + +t_begin "get pid of worker" && { + worker_pid=$(read_pid_unix) + t_info "worker_pid=$worker_pid" +} + +t_begin "fails to start with existing pid file" && { + rm -f $ok + unicorn -D -c $unix_config pid.ru || echo ok > $ok + test x"$(cat $ok)" = xok +} + +t_begin "worker pid unchanged" && { + test x"$(read_pid_unix)" = x$worker_pid + > $r_err +} + +t_begin "fails to start with listening UNIX domain socket bound" && { + rm $ok $pid + unicorn -D -c $unix_config pid.ru || echo ok > $ok + test x"$(cat $ok)" = xok + > $r_err +} + +t_begin "worker pid unchanged (again)" && { + test x"$(read_pid_unix)" = x$worker_pid +} + +t_begin "nuking the existing Unicorn succeeds" && { + kill -9 $unicorn_pid $worker_pid + while kill -0 $unicorn_pid + do + sleep 1 + done + check_stderr +} + +t_begin "succeeds in starting with leftover UNIX domain socket bound" && { + test -S $unix_socket + unicorn -D -c $unix_config pid.ru + unicorn_wait_start +} + +t_begin "worker pid changed" && { + test x"$(read_pid_unix)" != x$worker_pid +} + +t_begin "killing succeeds" && { + kill $unicorn_pid +} + +t_begin "no errors" && check_stderr + +t_done diff --git a/test/unit/test_socket_helper.rb b/test/unit/test_socket_helper.rb index bbce359..c6d0d42 100644 --- a/test/unit/test_socket_helper.rb +++ b/test/unit/test_socket_helper.rb @@ -101,7 +101,14 @@ class TestSocketHelper < Test::Unit::TestCase def test_bind_listen_unix_rebind test_bind_listen_unix - new_listener = bind_listen(@unix_listener_path) + new_listener = nil + assert_raises(Errno::EADDRINUSE) do + new_listener = bind_listen(@unix_listener_path) + end + assert_nothing_raised do + File.unlink(@unix_listener_path) + new_listener = bind_listen(@unix_listener_path) + end assert UNIXServer === new_listener assert new_listener.fileno != @unix_listener.fileno assert_equal sock_name(new_listener), sock_name(@unix_listener) -- Eric Wong