Hi, I have a problem with one of my Rails servers where the daemonized unicorn_rails master seems to completely ignores the USR2 signal (only). First, about the environment. I''ve searched for solutions to this problem for several day now, and my situation appears to differ from most in that my environment is not using RVM, bundler, Capistrano, or other sandboxing tools. It''s a fairly plain jane Rails deployment. The OS is 64-bit Ubuntu 10.04 server. $ uname -a Linux 2.6.32-37-server #81-Ubuntu SMP Fri Dec 2 20:49:12 UTC 2011 x86_64 GNU/Linux $ lsb_release -a No LSB modules are available. Distributor ID: Ubuntu Description: Ubuntu 10.04.4 LTS Release: 10.04 Codename: lucid $ ruby -v ruby 1.9.1p376 (2009-12-07 revision 26041) [x86_64-linux] The Rails gem version is 2.3.14. Unicorn gem version 4.2.0. The Unicorn config includes "preload_app true". We use the same Unicorn configs on similar servers and for some yet unknown reason, this one is seeing a problem with USR2. I have done my best to debug it using strace and the stderr logs. The unicorn_rails master process handles HUP, QUIT, and USR1 signals as expected, no problems. However, the USR2 signal is being completely ignored. I added a before_exec block in the Unicorn conf with some puts statements, but this hook never gets called (unlike the before_fork and after_fork hooks, which are working fine for me). before_exec do |server| $stderr.puts("DEBUG before_exec") $stderr.puts(ENV.inspect) $stderr.puts(Unicorn::HttpServer::START_CTX.inspect) end While monitoring the master with strace, there are no calls observed when sending a kill -USR2. Nothing is logged to STDERR when sending this signal, either. This has me completely stymied with the lack of logging info. Does anyone have any clues that can point me in the right direction? -Jeff
"Yeung, Jeffrey" <Jeffrey.Yeung at polycom.com> wrote:> $ ruby -v > ruby 1.9.1p376 (2009-12-07 revision 26041) [x86_64-linux] > > The Rails gem version is 2.3.14. Unicorn gem version 4.2.0.> I have done my best to debug it using strace and the stderr logs. The > unicorn_rails master process handles HUP, QUIT, and USR1 signals as > expected, no problems. However, the USR2 signal is being completely > ignored.> While monitoring the master with strace, there are no calls observed > when sending a kill -USR2. Nothing is logged to STDERR when sending > this signal, either.Really strange, especially since other signals seem to work... Since you''re on Ruby (>= 1.9.0 && < 1.9.3), are you stracing with -f? If you''re not already, try using: strace -f -e ''!futex'' 1.9.x versions use a dedicated timer thread to accept signals, so you won''t see system signal handlers in the VM without "strace -f". The "-e ''!futex''" filters out the noise from the polling wakeup in <1.9.3 versions of Ruby. Can you try Ruby 1.9.3? The signal handling got completely reworked (for the better) and you won''t need "-e ''!futex''"
Thanks Eric. New strace capture as follows: $ sudo strace -f -e ''!futex'' -p 14255 Process 14255 attached with 12 threads - interrupt to quit [pid 14322] restart_syscall(<... resuming interrupted call ...> <unfinished ...> [pid 14271] restart_syscall(<... resuming interrupted call ...> <unfinished ...> [pid 14267] accept(12, <unfinished ...> [pid 14264] restart_syscall(<... resuming interrupted call ...> <unfinished ...> [pid 14255] select(6, [5], NULL, NULL, {42, 86504} <unfinished ...> [pid 14322] <... restart_syscall resumed> ) = -1 ETIMEDOUT (Connection timed out) [pid 14271] <... restart_syscall resumed> ) = -1 ETIMEDOUT (Connection timed out) [pid 14264] <... restart_syscall resumed> ) = -1 ETIMEDOUT (Connection timed out) [pid 14262] --- SIGUSR2 (User defined signal 2) @ 0 (0) --- [pid 14262] rt_sigreturn(0x20) = 202 The last two lines do not say much for the event. :( -----Original Message----- From: Eric Wong [mailto:normalperson at yhbt.net] Sent: Friday, March 09, 2012 2:24 PM To: unicorn list Cc: Yeung, Jeffrey Subject: Re: Unicorn_rails ignores USR2 signal "Yeung, Jeffrey" <Jeffrey.Yeung at polycom.com> wrote:> $ ruby -v > ruby 1.9.1p376 (2009-12-07 revision 26041) [x86_64-linux] > > The Rails gem version is 2.3.14. Unicorn gem version 4.2.0.> I have done my best to debug it using strace and the stderr logs. The > unicorn_rails master process handles HUP, QUIT, and USR1 signals as > expected, no problems. However, the USR2 signal is being completely > ignored.> While monitoring the master with strace, there are no calls observed > when sending a kill -USR2. Nothing is logged to STDERR when sending > this signal, either.Really strange, especially since other signals seem to work... Since you''re on Ruby (>= 1.9.0 && < 1.9.3), are you stracing with -f? If you''re not already, try using: strace -f -e ''!futex'' 1.9.x versions use a dedicated timer thread to accept signals, so you won''t see system signal handlers in the VM without "strace -f". The "-e ''!futex''" filters out the noise from the polling wakeup in <1.9.3 versions of Ruby. Can you try Ruby 1.9.3? The signal handling got completely reworked (for the better) and you won''t need "-e ''!futex''"
"Yeung, Jeffrey" <Jeffrey.Yeung at polycom.com> wrote:> Thanks Eric. New strace capture as follows: > > $ sudo strace -f -e ''!futex'' -p 14255 > Process 14255 attached with 12 threads - interrupt to quit>From that, you have 5 worker processes?For debugging this, it can cut down on noise to only use one worker process, too. You can check if SIGTTOU works, too :) Also, can you reproduce this on a freshly-started master? Or has the master been running and handling other signals for a while? The most common cause of USR2 failures is due to an executable or library being moved/replaced/upgraded away (sometimes due to Capistrano), but that should definitely get logged and doesn''t seem to be the case for you.> [pid 14322] restart_syscall(<... resuming interrupted call ...> <unfinished ...> > [pid 14271] restart_syscall(<... resuming interrupted call ...> <unfinished ...> > [pid 14267] accept(12, <unfinished ...> > [pid 14264] restart_syscall(<... resuming interrupted call ...> <unfinished ...> > [pid 14255] select(6, [5], NULL, NULL, {42, 86504} <unfinished ...> > [pid 14322] <... restart_syscall resumed> ) = -1 ETIMEDOUT (Connection timed out) > [pid 14271] <... restart_syscall resumed> ) = -1 ETIMEDOUT (Connection timed out) > [pid 14264] <... restart_syscall resumed> ) = -1 ETIMEDOUT (Connection timed out) > [pid 14262] --- SIGUSR2 (User defined signal 2) @ 0 (0) --- > [pid 14262] rt_sigreturn(0x20) = 202 > > The last two lines do not say much for the event. :(Anything more after that? What happens when you send a previously working signal (USR1, HUP) after sending a failed USR2 to that process?
I have it configured for 4 workers. I just turned it down to 1 worker and tested both TTIN and TTOU. They both work, creating and killing workers, respectively. The following is the strace capture from an TTOU signal, followed by a USR2 signal. Yes, that is the only output from the USR2 signal. Behavior doesn''t change if the master is freshly started, or has been running for a while. One other thing I did not mention earlier is that the <pid>.oldbin file never gets created on the USR2 signal, but that''s probably obvious already. [pid 14463] --- SIGTTOU (Stopped (tty output)) @ 0 (0) --- [pid 14463] rt_sigreturn(0x16) = -1 EINTR (Interrupted system call) [pid 14483] rt_sigprocmask(SIG_SETMASK, ~[SEGV VTALRM RTMIN RT_1], NULL, 8) = 0 [pid 14483] rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 [pid 14483] tgkill(14456, 14456, SIGVTALRM) = 0 [pid 14456] <... select resumed> ) = ? ERESTARTNOHAND (To be restarted) [pid 14456] --- SIGVTALRM (Virtual timer expired) @ 0 (0) --- [pid 14456] rt_sigreturn(0x1a) = -1 EINTR (Interrupted system call) [pid 14456] fcntl(6, F_GETFL) = 0x801 (flags O_WRONLY|O_NONBLOCK) [pid 14456] write(6, ".", 1) = 1 [pid 14456] select(6, [5], NULL, NULL, {30, 755188}) = 1 (in [5], left {30, 755184}) [pid 14456] fcntl(5, F_GETFL) = 0x800 (flags O_RDONLY|O_NONBLOCK) [pid 14456] read(5, ".", 11) = 1 [pid 14456] wait4(-1, 0x7fffcb5560ec, WNOHANG, NULL) = 0 [pid 14456] wait4(-1, 0x7fffcb5560ec, WNOHANG, NULL) = 0 [pid 14456] kill(14482, SIGQUIT <unfinished ...> [pid 14482] <... select resumed> ) = ? ERESTARTNOHAND (To be restarted) [pid 14456] <... kill resumed> ) = 0 [pid 14482] --- SIGQUIT (Quit) @ 0 (0) --- [pid 14456] select(6, [5], NULL, NULL, {83, 0} <unfinished ...> [pid 14482] rt_sigreturn(0x3) = -1 EINTR (Interrupted system call) [pid 14482] sched_yield() = 0 [pid 14482] sched_yield() = 0 [pid 14482] sched_yield() = 0 [pid 14482] sched_yield() = 0 [pid 14482] sched_yield() = 0 [pid 14482] sched_yield() = 0 [pid 14482] sched_yield() = 0 [pid 14482] sched_yield() = 0 [pid 14482] sched_yield() = 0 [pid 14482] sched_yield() = 0 [pid 14482] sched_yield() = 0 [pid 14482] sched_yield() = 0 [pid 14482] sched_yield() = 0 [pid 14482] sched_yield() = 0 [pid 14482] sched_yield() = 0 [pid 14482] sched_yield() = 0 [pid 14482] sched_yield() = 0 [pid 14482] sched_yield() = 0 [pid 14482] sched_yield() = 0 [pid 14482] sched_yield() = 0 [pid 14482] sched_yield() = 0 [pid 14482] sched_yield() = 0 [pid 14482] sched_yield() = 0 [pid 14482] sched_yield() = 0 [pid 14482] sched_yield() = 0 [pid 14482] sched_yield() = 0 [pid 14482] sched_yield() = 0 [pid 14482] sched_yield() = 0 [pid 14482] sched_yield() = 0 [pid 14482] sched_yield() = 0 [pid 14482] sched_yield() = 0 [pid 14482] sched_yield() = 0 [pid 14482] sched_yield() = 0 [pid 14482] sched_yield() = 0 [pid 14482] sched_yield() = 0 [pid 14482] sched_yield() = 0 [pid 14482] sched_yield() = 0 [pid 14482] sched_yield() = 0 [pid 14482] sched_yield() = 0 [pid 14482] sched_yield() = 0 [pid 14482] sched_yield() = 0 [pid 14482] sched_yield() = 0 [pid 14482] sched_yield() = 0 [pid 14482] sched_yield() = 0 [pid 14482] sched_yield() = 0 [pid 14482] sched_yield() = 0 [pid 14482] sched_yield() = 0 [pid 14482] sched_yield() = 0 [pid 14482] sched_yield() = 0 [pid 14482] sched_yield() = 0 [pid 14482] sched_yield() = 0 [pid 14482] sched_yield() = 0 [pid 14482] sched_yield() = 0 [pid 14482] sched_yield() = 0 [pid 14482] sched_yield() = 0 [pid 14482] sched_yield() = 0 [pid 14482] sched_yield() = 0 [pid 14482] sched_yield() = 0 [pid 14482] sched_yield() = 0 [pid 14482] sched_yield() = 0 [pid 14482] sched_yield() = 0 [pid 14482] sched_yield() = 0 [pid 14482] sched_yield() = 0 [pid 14482] sched_yield() = 0 [pid 14482] sched_yield() = 0 [pid 14482] sched_yield() = 0 [pid 14482] sched_yield() = 0 [pid 14482] sched_yield() = 0 [pid 14482] sched_yield() = 0 [pid 14482] sched_yield() = 0 [pid 14482] sched_yield() = 0 [pid 14482] sched_yield() = 0 [pid 14482] sched_yield() = 0 [pid 14482] sched_yield() = 0 [pid 14482] sched_yield() = 0 [pid 14482] sched_yield() = 0 [pid 14482] sched_yield() = 0 [pid 14482] sched_yield() = 0 [pid 14482] sched_yield() = 0 [pid 14482] sched_yield() = 0 [pid 14482] sched_yield() = 0 [pid 14482] sched_yield() = 0 [pid 14482] sched_yield() = 0 [pid 14482] sched_yield() = 0 [pid 14482] sched_yield() = 0 [pid 14482] sched_yield() = 0 [pid 14482] sched_yield() = 0 [pid 14482] sched_yield() = 0 [pid 14482] sched_yield() = 0 [pid 14482] sched_yield() = 0 [pid 14482] sched_yield() = 0 [pid 14482] sched_yield() = 0 [pid 14482] sched_yield() = 0 [pid 14482] sched_yield() = 0 [pid 14482] sched_yield( <unfinished ...> [pid 14485] rt_sigprocmask(SIG_SETMASK, ~[SEGV VTALRM RTMIN RT_1], NULL, 8) = 0 [pid 14485] rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 [pid 14482] <... sched_yield resumed> ) = 0 [pid 14482] close(3) = 0 [pid 14482] select(6, [3], NULL, [4 5], {22, 883208}) = -1 EBADF (Bad file descriptor) [pid 14482] rt_sigprocmask(SIG_SETMASK, ~[HUP INT QUIT KILL USR1 SEGV USR2 TERM CHLD STOP TTIN TTOU VTALRM WINCH RTMIN RT_1], NULL, 8) = 0 [pid 14482] rt_sigprocmask(SIG_SETMASK, ~[HUP INT QUIT KILL USR1 SEGV USR2 TERM CHLD STOP TTIN TTOU VTALRM WINCH RTMIN RT_1], NULL, 8) = 0 [pid 14482] rt_sigprocmask(SIG_SETMASK, ~[HUP INT QUIT KILL USR1 SEGV USR2 TERM CHLD STOP TTIN TTOU VTALRM WINCH RTMIN RT_1], NULL, 8) = 0 [pid 14482] getrlimit(RLIMIT_STACK, {rlim_cur=8192*1024, rlim_max=RLIM_INFINITY}) = 0 [pid 14485] _exit(0) = ? Process 14485 detached [pid 14482] rt_sigaction(SIGINT, {SIG_IGN, [], SA_RESTORER|SA_SIGINFO, 0x7f10515428f0}, {0x48fc00, [], SA_RESTORER|SA_SIGINFO, 0x7f10515428f0}, 8) = 0 [pid 14482] rt_sigaction(SIGINT, {SIG_DFL, [], SA_RESTORER|SA_SIGINFO, 0x7f10515428f0}, {SIG_IGN, [], SA_RESTORER|SA_SIGINFO, 0x7f10515428f0}, 8) = 0 [pid 14482] munmap(0x7f10517b4000, 4096) = 0 [pid 14482] close(4) = 0 [pid 14482] close(5) = 0 [pid 14482] close(7) = 0 [pid 14482] close(12) = 0 [pid 14482] unlink("/tmp/.java_pid14456") = -1 EPERM (Operation not permitted) [pid 14482] exit_group(0) = ? Process 14482 detached [pid 14463] --- SIGCHLD (Child exited) @ 0 (0) --- [pid 14463] rt_sigreturn(0x11) = -1 EINTR (Interrupted system call) [pid 14483] rt_sigprocmask(SIG_SETMASK, ~[SEGV VTALRM RTMIN RT_1], NULL, 8) = 0 [pid 14483] rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 [pid 14483] tgkill(14456, 14456, SIGVTALRM) = 0 [pid 14456] <... select resumed> ) = ? ERESTARTNOHAND (To be restarted) [pid 14456] --- SIGVTALRM (Virtual timer expired) @ 0 (0) --- [pid 14456] rt_sigreturn(0x1a) = -1 EINTR (Interrupted system call) [pid 14456] fcntl(6, F_GETFL) = 0x801 (flags O_WRONLY|O_NONBLOCK) [pid 14456] write(6, ".", 1) = 1 [pid 14456] select(6, [5], NULL, NULL, {82, 908414}) = 1 (in [5], left {82, 908410}) [pid 14456] fcntl(5, F_GETFL) = 0x800 (flags O_RDONLY|O_NONBLOCK) [pid 14456] read(5, ".", 11) = 1 [pid 14456] wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 14482 [pid 14456] write(2, "reaped #<Process::Status: pid 14"..., 53) = 53 [pid 14456] wait4(-1, 0x7fffcb5560ec, WNOHANG, NULL) = 0 [pid 14456] select(6, [5], NULL, NULL, {91, 0} <unfinished ...> [pid 14463] --- SIGUSR2 (User defined signal 2) @ 0 (0) --- [pid 14463] rt_sigreturn(0x20) = 202 -----Original Message----- From: Eric Wong [mailto:normalperson at yhbt.net] Sent: Friday, March 09, 2012 4:03 PM To: unicorn list Cc: Yeung, Jeffrey Subject: Re: Unicorn_rails ignores USR2 signal "Yeung, Jeffrey" <Jeffrey.Yeung at polycom.com> wrote:> Thanks Eric. New strace capture as follows: > > $ sudo strace -f -e ''!futex'' -p 14255 Process 14255 attached with 12 > threads - interrupt to quit>From that, you have 5 worker processes?For debugging this, it can cut down on noise to only use one worker process, too. You can check if SIGTTOU works, too :) Also, can you reproduce this on a freshly-started master? Or has the master been running and handling other signals for a while? The most common cause of USR2 failures is due to an executable or library being moved/replaced/upgraded away (sometimes due to Capistrano), but that should definitely get logged and doesn''t seem to be the case for you.> [pid 14322] restart_syscall(<... resuming interrupted call ...> > <unfinished ...> [pid 14271] restart_syscall(<... resuming > interrupted call ...> <unfinished ...> [pid 14267] accept(12, > <unfinished ...> [pid 14264] restart_syscall(<... resuming > interrupted call ...> <unfinished ...> [pid 14255] select(6, [5], > NULL, NULL, {42, 86504} <unfinished ...> [pid 14322] <... > restart_syscall resumed> ) = -1 ETIMEDOUT (Connection timed out) [pid > 14271] <... restart_syscall resumed> ) = -1 ETIMEDOUT (Connection > timed out) [pid 14264] <... restart_syscall resumed> ) = -1 ETIMEDOUT (Connection timed out) [pid 14262] --- SIGUSR2 (User defined signal 2) @ 0 (0) --- > [pid 14262] rt_sigreturn(0x20) = 202 > > The last two lines do not say much for the event. :(Anything more after that? What happens when you send a previously working signal (USR1, HUP) after sending a failed USR2 to that process?
"Yeung, Jeffrey" <Jeffrey.Yeung at polycom.com> wrote:> I have it configured for 4 workers. I just turned it down to 1 worker > and tested both TTIN and TTOU. They both work, creating and killing > workers, respectively. The following is the strace capture from an > TTOU signal, followed by a USR2 signal. Yes, that is the only output > from the USR2 signal. Behavior doesn''t change if the master is > freshly started, or has been running for a while. One other thing I > did not mention earlier is that the <pid>.oldbin file never gets > created on the USR2 signal, but that''s probably obvious already.I also asked for sending another signal after USR2, can you try that? However, there''s another possibility I hadn''t considered, what if you disable preload_app? Your app or some libs it uses may be intercepting USR2 for something it does. Maybe this patch can work, but this may also silently break your application/lib, too... diff --git a/lib/unicorn/http_server.rb b/lib/unicorn/http_server.rb index 7d2c623..1b9d693 100644 --- a/lib/unicorn/http_server.rb +++ b/lib/unicorn/http_server.rb @@ -128,8 +128,7 @@ class Unicorn::HttpServer # setup signal handlers before writing pid file in case people get # trigger happy and send signals as soon as the pid file exists. # Note that signals don''t actually get handled until the #join method - QUEUE_SIGS.each { |sig| trap(sig) { SIG_QUEUE << sig; awaken_master } } - trap(:CHLD) { awaken_master } + master_siginit self.pid = config[:pid] self.master_pid = $$ @@ -689,6 +688,9 @@ class Unicorn::HttpServer Gem.refresh end self.app = app.call + + # override signal handlers the app may have set + master_siginit if preload_app end end @@ -736,4 +738,9 @@ class Unicorn::HttpServer config_listeners.each { |addr| listen(addr) } raise ArgumentError, "no listeners" if LISTENERS.empty? end + + def master_siginit + QUEUE_SIGS.each { |sig| trap(sig) { SIG_QUEUE << sig; awaken_master } } + trap(:CHLD) { awaken_master } + end end
Eric Wong <normalperson at yhbt.net> wrote:> However, there''s another possibility I hadn''t considered, what if you > disable preload_app? Your app or some libs it uses may be intercepting > USR2 for something it does.Ping? Was this it?
Hi Eric, Sorry for the delay. It looks like disabling preload_app did the trick. A new master was created after sending the USR2. Now the $$$ question is, what in the world is intercepting the signal? :S And this is a little late now, but here''s the strace of a USR2 signal followed by a TTOU signal, with preload_app true (latter signal working okay in this case): [pid 14542] --- SIGUSR2 (User defined signal 2) @ 0 (0) --- [pid 14542] rt_sigreturn(0x20) = 202 [pid 14535] <... select resumed> ) = 0 (Timeout) [pid 14535] wait4(-1, 0x7fff6bcccb9c, WNOHANG, NULL) = 0 [pid 14535] select(6, [5], NULL, NULL, {60, 0} <unfinished ...> [pid 14542] --- SIGTTOU (Stopped (tty output)) @ 0 (0) --- [pid 14542] rt_sigreturn(0x16) = -1 EINTR (Interrupted system call) [pid 14553] rt_sigprocmask(SIG_SETMASK, ~[SEGV VTALRM RTMIN RT_1], NULL, 8) = 0 [pid 14553] rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 [pid 14553] tgkill(14535, 14535, SIGVTALRM) = 0 [pid 14535] <... select resumed> ) = ? ERESTARTNOHAND (To be restarted) [pid 14535] --- SIGVTALRM (Virtual timer expired) @ 0 (0) --- [pid 14535] rt_sigreturn(0x1a) = -1 EINTR (Interrupted system call) [pid 14535] fcntl(6, F_GETFL) = 0x1 (flags O_WRONLY) [pid 14535] fcntl(6, F_SETFL, O_WRONLY|O_NONBLOCK) = 0 [pid 14535] write(6, ".", 1) = 1 [pid 14535] select(6, [5], NULL, NULL, {59, 939700}) = 1 (in [5], left {59, 939696}) [pid 14535] fcntl(5, F_GETFL) = 0 (flags O_RDONLY) [pid 14535] fcntl(5, F_SETFL, O_RDONLY|O_NONBLOCK) = 0 [pid 14535] read(5, ".", 11) = 1 [pid 14535] wait4(-1, 0x7fff6bcccb9c, WNOHANG, NULL) = 0 [pid 14535] wait4(-1, 0x7fff6bcccb9c, WNOHANG, NULL) = 0 [pid 14535] kill(14552, SIGQUIT) = 0 [pid 14535] select(6, [5], NULL, NULL, {60, 0} <unfinished ...> [pid 14542] --- SIGCHLD (Child exited) @ 0 (0) --- [pid 14542] rt_sigreturn(0x11) = -1 EINTR (Interrupted system call) [pid 14553] rt_sigprocmask(SIG_SETMASK, ~[SEGV VTALRM RTMIN RT_1], NULL, 8) = 0 [pid 14553] rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 [pid 14553] tgkill(14535, 14535, SIGVTALRM) = 0 [pid 14535] <... select resumed> ) = ? ERESTARTNOHAND (To be restarted) [pid 14535] --- SIGVTALRM (Virtual timer expired) @ 0 (0) --- [pid 14535] rt_sigreturn(0x1a) = -1 EINTR (Interrupted system call) [pid 14535] fcntl(6, F_GETFL) = 0x801 (flags O_WRONLY|O_NONBLOCK) [pid 14535] write(6, ".", 1) = 1 [pid 14535] select(6, [5], NULL, NULL, {59, 880004}) = 1 (in [5], left {59, 880001}) [pid 14535] fcntl(5, F_GETFL) = 0x800 (flags O_RDONLY|O_NONBLOCK) [pid 14535] read(5, ".", 11) = 1 [pid 14535] wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 14552 [pid 14535] write(2, "reaped #<Process::Status: pid 14"..., 53) = 53 [pid 14535] wait4(-1, 0x7fff6bcccb9c, WNOHANG, NULL) = -1 ECHILD (No child processes) [pid 14535] rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 [pid 14535] select(6, [5], NULL, NULL, {119, 0} -----Original Message----- From: Eric Wong [mailto:normalperson at yhbt.net] Sent: Monday, March 12, 2012 2:21 PM To: unicorn list Cc: Yeung, Jeffrey Subject: Re: Unicorn_rails ignores USR2 signal Eric Wong <normalperson at yhbt.net> wrote:> However, there''s another possibility I hadn''t considered, what if you > disable preload_app? Your app or some libs it uses may be > intercepting > USR2 for something it does.Ping? Was this it?
"Yeung, Jeffrey" <Jeffrey.Yeung at polycom.com> wrote:> Sorry for the delay. It looks like disabling preload_app did the > trick. A new master was created after sending the USR2. Now the $$$ > question is, what in the world is intercepting the signal? :SGood to know, I''d just grep the installation directories for all your Ruby libs + gems for USR2. I haven''t seen this problem before, but it''d be good to document the conflict, at least.
Eric Wong <normalperson at yhbt.net> wrote:> "Yeung, Jeffrey" <Jeffrey.Yeung at polycom.com> wrote: > > Sorry for the delay. It looks like disabling preload_app did the > > trick. A new master was created after sending the USR2. Now the $$$ > > question is, what in the world is intercepting the signal? :S > > Good to know, I''d just grep the installation directories for all your > Ruby libs + gems for USR2. I haven''t seen this problem before, but it''d > be good to document the conflict, at least.Btw, did you ever figure out what was causing the conflict? Pushing this out to git://bogomips.org/unicorn.git>From 1e13ffee3469997286e65e0563b6433e7744388a Mon Sep 17 00:00:00 2001From: Eric Wong <normalperson at yhbt.net> Date: Tue, 20 Mar 2012 19:51:35 +0000 Subject: [PATCH] KNOWN_ISSUES: document signal conflicts in libs/apps Jeffrey Yeung confirmed this issue on the mailing list. ref: <E8D9E7CCC2621343A0A3BB45E8DEDFA91C682DD23D at CRPMBOXPRD04.polycom.com> --- KNOWN_ISSUES | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/KNOWN_ISSUES b/KNOWN_ISSUES index f323c68..38263e7 100644 --- a/KNOWN_ISSUES +++ b/KNOWN_ISSUES @@ -3,6 +3,11 @@ Occasionally odd {issues}[link:ISSUES.html] arise without a transparent or acceptable solution. Those issues are documented here. +* Some libraries/applications may install signal handlers which conflict + with signal handlers unicorn uses. Leaving "preload_app false" + (the default) will allow unicorn to always override existing signal + handlers. + * Issues with FreeBSD jails can be worked around as documented by Tatsuya Ono: http://mid.gmane.org/CAHBuKRj09FdxAgzsefJWotexw-7JYZGJMtgUp_dhjPz9VbKD6Q at mail.gmail.com
Eric, I have been unable to narrow down the cause of the conflict so far. The list of Ruby gems (and gem versions) on the affected deployment are identical to the ones on another deployment where Unicorn is upgrading just fine (with preloaded app). Grep''ing for USR2 in the gem installations did not reveal anything, unfortunately. Since then, I haven''t been able to spend further time investigating. Not sure where else to look, really, but I''m open to further suggestions. -Jeff -----Original Message----- From: Eric Wong [mailto:normalperson at yhbt.net] Sent: Tuesday, March 20, 2012 12:58 PM To: unicorn list Cc: Yeung, Jeffrey Subject: Re: Unicorn_rails ignores USR2 signal Eric Wong <normalperson at yhbt.net> wrote:> "Yeung, Jeffrey" <Jeffrey.Yeung at polycom.com> wrote: > > Sorry for the delay. It looks like disabling preload_app did the > > trick. A new master was created after sending the USR2. Now the > > $$$ question is, what in the world is intercepting the signal? :S > > Good to know, I''d just grep the installation directories for all your > Ruby libs + gems for USR2. I haven''t seen this problem before, but > it''d be good to document the conflict, at least.Btw, did you ever figure out what was causing the conflict? Pushing this out to git://bogomips.org/unicorn.git>From 1e13ffee3469997286e65e0563b6433e7744388a Mon Sep 17 00:00:00 2001From: Eric Wong <normalperson at yhbt.net> Date: Tue, 20 Mar 2012 19:51:35 +0000 Subject: [PATCH] KNOWN_ISSUES: document signal conflicts in libs/apps Jeffrey Yeung confirmed this issue on the mailing list. ref: <E8D9E7CCC2621343A0A3BB45E8DEDFA91C682DD23D at CRPMBOXPRD04.polycom.com> --- KNOWN_ISSUES | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/KNOWN_ISSUES b/KNOWN_ISSUES index f323c68..38263e7 100644 --- a/KNOWN_ISSUES +++ b/KNOWN_ISSUES @@ -3,6 +3,11 @@ Occasionally odd {issues}[link:ISSUES.html] arise without a transparent or acceptable solution. Those issues are documented here. +* Some libraries/applications may install signal handlers which +conflict + with signal handlers unicorn uses. Leaving "preload_app false" + (the default) will allow unicorn to always override existing signal + handlers. + * Issues with FreeBSD jails can be worked around as documented by Tatsuya Ono: http://mid.gmane.org/CAHBuKRj09FdxAgzsefJWotexw-7JYZGJMtgUp_dhjPz9VbKD6Q at mail.gmail.com
On 3/20/12 4:09 PM, Yeung, Jeffrey wrote:> I have been unable to narrow down the cause of the conflict so far. The list of Ruby gems (and gem versions) on the affected deployment are identical to the ones on another deployment where Unicorn is upgrading just fine (with preloaded app). Grep''ing for USR2 in the gem installations did not reveal anything, unfortunately. Since then, I haven''t been able to spend further time investigating. Not sure where else to look, really, but I''m open to further suggestions.Jeffery, To uncover the culprit, you might try monkey-patching Kernel.trap and Signal.trap so it logs the last few entries in caller when it''s called with USR2. Put something like this really early in you app bootstrap: [Kernel,Signal].each |klass| class << klass alias :orig_trap :trap def trap *args, &block if args.first.to_s =~ /USR2$/i || args.first.to_i == 31 $stderr.puts "Caught someone trapping USR2 caller is:", caller.last(2) end orig_trap *args, &block end end end
I have little to report, unfortunately. I still have not identified what is trapping USR2 in my application, if something is indeed doing that. I''ve attempted to do a bit more investigation work in my spare time this week, and the results are still rather mysterious. First, I applied your patch to http_server.rb that you suggested in this thread back on March 9th (quoted below), which re-overrides the signal handlers in build_app!(). The patch did work. Okay, that seems to reconfirm that the app is doing something with the USR2 signal handler... Then I modified the master_siginit() method a bit further to provide more info: def master_siginit QUEUE_SIGS.each { |sig| old_handler = trap(sig) { SIG_QUEUE << sig; awaken_master } logger.info("old handler for #{sig.to_s}: #{old_handler.inspect}") } trap(:CHLD) { awaken_master } end And the output was a bit surprising. Here is the Unicorn stderr log when starting up the service. master_siginit() is called twice, as expected... I, [2012-03-30T14:57:51.301839 #15828] INFO -- : listening on addr=0.0.0.0:8080 fd=3 I, [2012-03-30T14:57:51.302341 #15828] INFO -- : old handler for WINCH: nil I, [2012-03-30T14:57:51.302435 #15828] INFO -- : old handler for QUIT: "DEFAULT" I, [2012-03-30T14:57:51.302505 #15828] INFO -- : old handler for INT: "DEFAULT" I, [2012-03-30T14:57:51.302573 #15828] INFO -- : old handler for TERM: "DEFAULT" I, [2012-03-30T14:57:51.302641 #15828] INFO -- : old handler for USR1: "DEFAULT" I, [2012-03-30T14:57:51.302709 #15828] INFO -- : old handler for USR2: "DEFAULT" I, [2012-03-30T14:57:51.302786 #15828] INFO -- : old handler for HUP: "DEFAULT" I, [2012-03-30T14:57:51.302853 #15828] INFO -- : old handler for TTIN: nil I, [2012-03-30T14:57:51.302917 #15828] INFO -- : old handler for TTOU: nil I, [2012-03-30T14:57:51.303502 #15828] INFO -- : Refreshing Gem list old handler for WINCH: #<Proc:0x00000002d97488@/usr/local/lib/ruby/gems/1.9.1/gems/unicorn-4.2.0/lib/unicorn/http_server.rb:745> old handler for QUIT: #<Proc:0x00000002d97098@/usr/local/lib/ruby/gems/1.9.1/gems/unicorn-4.2.0/lib/unicorn/http_server.rb:745> old handler for INT: #<Proc:0x00000002d96c70@/usr/local/lib/ruby/gems/1.9.1/gems/unicorn-4.2.0/lib/unicorn/http_server.rb:745> old handler for TERM: #<Proc:0x00000002d96848@/usr/local/lib/ruby/gems/1.9.1/gems/unicorn-4.2.0/lib/unicorn/http_server.rb:745> old handler for USR1: #<Proc:0x00000002d96420@/usr/local/lib/ruby/gems/1.9.1/gems/unicorn-4.2.0/lib/unicorn/http_server.rb:745> old handler for USR2: #<Proc:0x00000002d95ff8@/usr/local/lib/ruby/gems/1.9.1/gems/unicorn-4.2.0/lib/unicorn/http_server.rb:745> old handler for HUP: #<Proc:0x00000002d95bd0@/usr/local/lib/ruby/gems/1.9.1/gems/unicorn-4.2.0/lib/unicorn/http_server.rb:745> old handler for TTIN: #<Proc:0x00000002d957a8@/usr/local/lib/ruby/gems/1.9.1/gems/unicorn-4.2.0/lib/unicorn/http_server.rb:745> old handler for TTOU: #<Proc:0x00000002d953b8@/usr/local/lib/ruby/gems/1.9.1/gems/unicorn-4.2.0/lib/unicorn/http_server.rb:745> master process ready worker=0 ready Which to me, looks like the USR2 handler was unchanged in the second master_siginit() call. Yet, if I remove that second master_siginit() call from there, USR2 ceases to work. :S -Jeff -----Original Message----- diff --git a/lib/unicorn/http_server.rb b/lib/unicorn/http_server.rb index 7d2c623..1b9d693 100644 --- a/lib/unicorn/http_server.rb +++ b/lib/unicorn/http_server.rb @@ -128,8 +128,7 @@ class Unicorn::HttpServer # setup signal handlers before writing pid file in case people get # trigger happy and send signals as soon as the pid file exists. # Note that signals don''t actually get handled until the #join method - QUEUE_SIGS.each { |sig| trap(sig) { SIG_QUEUE << sig; awaken_master } } - trap(:CHLD) { awaken_master } + master_siginit self.pid = config[:pid] self.master_pid = $$ @@ -689,6 +688,9 @@ class Unicorn::HttpServer Gem.refresh end self.app = app.call + + # override signal handlers the app may have set + master_siginit if preload_app end end @@ -736,4 +738,9 @@ class Unicorn::HttpServer config_listeners.each { |addr| listen(addr) } raise ArgumentError, "no listeners" if LISTENERS.empty? end + + def master_siginit + QUEUE_SIGS.each { |sig| trap(sig) { SIG_QUEUE << sig; awaken_master } } + trap(:CHLD) { awaken_master } + end end
Hey Jeffrey, we had a similar problem about 8 months ago. What version of ruby (including patch number) are you using? I think you said you''re not using anything special in your deployment environment, right? You''re not using god, by chance to do the restarts, are you? Also, how are you doing your deployments, if not with cap? -- Alex Sharp Zaarly, Inc | @ajsharp | github.com/ajsharp | alexjsharp.com
We''ve run into the exact same problem. We spent a few good solid hours trying to fix the problem with no avail (but we do have a ''solution''). Server specs: Ubuntu 12.04.2 LTS (GNU/Linux 3.5.0-27-generic x86_64) Rails & Ruby: Rails 4.0.0.beta1 Ruby 2.0.0p0 (2013-02-24 revision 39474) [x86_64-linux] Unicorn 4.6.2 - preload_app true Capistrano The first thing we tried was removing all of the gems (foreman, sidekiq, etc.) and that didn''t work. We tried preload_app false, and that worked (but this is not the situation we want). We also tried searching for any gems that were capturing USR2 (or any signals in general). We tried monkey patching Kernel & Trap to see if we could pinpoint the problem, but that didn''t seem to turn up much. We then jumped in unicorn''s http_server.rb and threw in some logging. The only thing that seemed to show us was that USR2 was unable to get through to unicorn. Signals HUP, QUIT, etc. all worked fine (and made their way through as per norm). We then fired up strace with unicorn to see if we could ''see'' the USR2 signal. It looks almost identical to Jeffery''s strace. It looks like the USR2 signal is received by the process, but then dropped (or ignored). If we pass it a QUIT signal it works just fine and we can see that in the strace. So here''s the chain of events that we think is occuring: 1. We start unicorn (a unicorn blessing) 2. before_fork is triggered (Signal trapping works here) 3. A unicorn is born (Signal trapping does not work here) 4. The after fork is triggered In our situation. Once a unicorn is born, it can no longer received a USR2 (for whatever reason). We never actually checked to see if the after fork received a USR2. After we managed to figure out the path of execution, we figured out a quick fix. We actually call reexec method directly on the unicorn server in the before_fork block in unicorn.rb. Here''s our unicorn.rb https://gist.github.com/RyanonRails/5541902 ----- snip ----- before_fork do |server, worker| Signal.trap ''USR2'' do puts ''Since a black hole is lurking and eating USR2s we will hit http_server#reexec ourselves'' server.send(:reexec) end end ----- snip ----- This way we can actually force the USR2 directly on the server (since USR2 is available inside of the before_fork). This seems to be working well for us now. We''re still frustrated that we weren''t able to find the cause of the problem, but we least we can move forward (with minor disruption to the code/code base). Lastly, here are our current theories as to why this could be occuring: 1. When unicorn is building it''s native extensions something weird is happening in regards to the USR2 signal 2. A gem is somehow gobbling up the signal (we were quite thorough, but we might''ve missed something) Hopefully this helps! Thanks, Ryan & Mike _______________________________________________ Unicorn mailing list - mongrel-unicorn@rubyforge.org http://rubyforge.org/mailman/listinfo/mongrel-unicorn Do not quote signatures (like this one) or top post when replying