Hi,
I have a problem with one of my Rails servers where the daemonized unicorn_rails
master seems to completely ignores the USR2 signal (only).
First, about the environment. I''ve searched for solutions to this
problem for several day now, and my situation appears to differ from most in
that my environment is not using RVM, bundler, Capistrano, or other sandboxing
tools. It''s a fairly plain jane Rails deployment. The OS is 64-bit
Ubuntu 10.04 server.
$ uname -a
Linux 2.6.32-37-server #81-Ubuntu SMP Fri Dec 2 20:49:12 UTC 2011 x86_64
GNU/Linux
$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 10.04.4 LTS
Release: 10.04
Codename: lucid
$ ruby -v
ruby 1.9.1p376 (2009-12-07 revision 26041) [x86_64-linux]
The Rails gem version is 2.3.14. Unicorn gem version 4.2.0.
The Unicorn config includes "preload_app true". We use the same
Unicorn configs on similar servers and for some yet unknown reason, this one is
seeing a problem with USR2.
I have done my best to debug it using strace and the stderr logs. The
unicorn_rails master process handles HUP, QUIT, and USR1 signals as expected, no
problems. However, the USR2 signal is being completely ignored.
I added a before_exec block in the Unicorn conf with some puts statements, but
this hook never gets called (unlike the before_fork and after_fork hooks, which
are working fine for me).
before_exec do |server|
$stderr.puts("DEBUG before_exec")
$stderr.puts(ENV.inspect)
$stderr.puts(Unicorn::HttpServer::START_CTX.inspect)
end
While monitoring the master with strace, there are no calls observed when
sending a kill -USR2. Nothing is logged to STDERR when sending this signal,
either.
This has me completely stymied with the lack of logging info. Does anyone have
any clues that can point me in the right direction?
-Jeff
"Yeung, Jeffrey" <Jeffrey.Yeung at polycom.com> wrote:> $ ruby -v > ruby 1.9.1p376 (2009-12-07 revision 26041) [x86_64-linux] > > The Rails gem version is 2.3.14. Unicorn gem version 4.2.0.> I have done my best to debug it using strace and the stderr logs. The > unicorn_rails master process handles HUP, QUIT, and USR1 signals as > expected, no problems. However, the USR2 signal is being completely > ignored.> While monitoring the master with strace, there are no calls observed > when sending a kill -USR2. Nothing is logged to STDERR when sending > this signal, either.Really strange, especially since other signals seem to work... Since you''re on Ruby (>= 1.9.0 && < 1.9.3), are you stracing with -f? If you''re not already, try using: strace -f -e ''!futex'' 1.9.x versions use a dedicated timer thread to accept signals, so you won''t see system signal handlers in the VM without "strace -f". The "-e ''!futex''" filters out the noise from the polling wakeup in <1.9.3 versions of Ruby. Can you try Ruby 1.9.3? The signal handling got completely reworked (for the better) and you won''t need "-e ''!futex''"
Thanks Eric. New strace capture as follows:
$ sudo strace -f -e ''!futex'' -p 14255
Process 14255 attached with 12 threads - interrupt to quit
[pid 14322] restart_syscall(<... resuming interrupted call ...>
<unfinished ...>
[pid 14271] restart_syscall(<... resuming interrupted call ...>
<unfinished ...>
[pid 14267] accept(12, <unfinished ...>
[pid 14264] restart_syscall(<... resuming interrupted call ...>
<unfinished ...>
[pid 14255] select(6, [5], NULL, NULL, {42, 86504} <unfinished ...>
[pid 14322] <... restart_syscall resumed> ) = -1 ETIMEDOUT (Connection
timed out)
[pid 14271] <... restart_syscall resumed> ) = -1 ETIMEDOUT (Connection
timed out)
[pid 14264] <... restart_syscall resumed> ) = -1 ETIMEDOUT (Connection
timed out)
[pid 14262] --- SIGUSR2 (User defined signal 2) @ 0 (0) ---
[pid 14262] rt_sigreturn(0x20) = 202
The last two lines do not say much for the event. :(
-----Original Message-----
From: Eric Wong [mailto:normalperson at yhbt.net]
Sent: Friday, March 09, 2012 2:24 PM
To: unicorn list
Cc: Yeung, Jeffrey
Subject: Re: Unicorn_rails ignores USR2 signal
"Yeung, Jeffrey" <Jeffrey.Yeung at polycom.com>
wrote:> $ ruby -v
> ruby 1.9.1p376 (2009-12-07 revision 26041) [x86_64-linux]
>
> The Rails gem version is 2.3.14. Unicorn gem version 4.2.0.
> I have done my best to debug it using strace and the stderr logs. The
> unicorn_rails master process handles HUP, QUIT, and USR1 signals as
> expected, no problems. However, the USR2 signal is being completely
> ignored.
> While monitoring the master with strace, there are no calls observed
> when sending a kill -USR2. Nothing is logged to STDERR when sending
> this signal, either.
Really strange, especially since other signals seem to work...
Since you''re on Ruby (>= 1.9.0 && < 1.9.3), are you
stracing with -f?
If you''re not already, try using: strace -f -e
''!futex''
1.9.x versions use a dedicated timer thread to accept signals, so you
won''t see system signal handlers in the VM without "strace
-f".
The "-e ''!futex''" filters out the noise from the
polling wakeup in
<1.9.3 versions of Ruby.
Can you try Ruby 1.9.3? The signal handling got completely reworked (for the
better) and you won''t need "-e ''!futex''"
"Yeung, Jeffrey" <Jeffrey.Yeung at polycom.com> wrote:> Thanks Eric. New strace capture as follows: > > $ sudo strace -f -e ''!futex'' -p 14255 > Process 14255 attached with 12 threads - interrupt to quit>From that, you have 5 worker processes?For debugging this, it can cut down on noise to only use one worker process, too. You can check if SIGTTOU works, too :) Also, can you reproduce this on a freshly-started master? Or has the master been running and handling other signals for a while? The most common cause of USR2 failures is due to an executable or library being moved/replaced/upgraded away (sometimes due to Capistrano), but that should definitely get logged and doesn''t seem to be the case for you.> [pid 14322] restart_syscall(<... resuming interrupted call ...> <unfinished ...> > [pid 14271] restart_syscall(<... resuming interrupted call ...> <unfinished ...> > [pid 14267] accept(12, <unfinished ...> > [pid 14264] restart_syscall(<... resuming interrupted call ...> <unfinished ...> > [pid 14255] select(6, [5], NULL, NULL, {42, 86504} <unfinished ...> > [pid 14322] <... restart_syscall resumed> ) = -1 ETIMEDOUT (Connection timed out) > [pid 14271] <... restart_syscall resumed> ) = -1 ETIMEDOUT (Connection timed out) > [pid 14264] <... restart_syscall resumed> ) = -1 ETIMEDOUT (Connection timed out) > [pid 14262] --- SIGUSR2 (User defined signal 2) @ 0 (0) --- > [pid 14262] rt_sigreturn(0x20) = 202 > > The last two lines do not say much for the event. :(Anything more after that? What happens when you send a previously working signal (USR1, HUP) after sending a failed USR2 to that process?
I have it configured for 4 workers. I just turned it down to 1 worker and
tested both TTIN and TTOU. They both work, creating and killing workers,
respectively. The following is the strace capture from an TTOU signal, followed
by a USR2 signal. Yes, that is the only output from the USR2 signal. Behavior
doesn''t change if the master is freshly started, or has been running
for a while. One other thing I did not mention earlier is that the
<pid>.oldbin file never gets created on the USR2 signal, but
that''s probably obvious already.
[pid 14463] --- SIGTTOU (Stopped (tty output)) @ 0 (0) ---
[pid 14463] rt_sigreturn(0x16) = -1 EINTR (Interrupted system call)
[pid 14483] rt_sigprocmask(SIG_SETMASK, ~[SEGV VTALRM RTMIN RT_1], NULL, 8) = 0
[pid 14483] rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
[pid 14483] tgkill(14456, 14456, SIGVTALRM) = 0
[pid 14456] <... select resumed> ) = ? ERESTARTNOHAND (To be
restarted)
[pid 14456] --- SIGVTALRM (Virtual timer expired) @ 0 (0) ---
[pid 14456] rt_sigreturn(0x1a) = -1 EINTR (Interrupted system call)
[pid 14456] fcntl(6, F_GETFL) = 0x801 (flags O_WRONLY|O_NONBLOCK)
[pid 14456] write(6, ".", 1) = 1
[pid 14456] select(6, [5], NULL, NULL, {30, 755188}) = 1 (in [5], left {30,
755184})
[pid 14456] fcntl(5, F_GETFL) = 0x800 (flags O_RDONLY|O_NONBLOCK)
[pid 14456] read(5, ".", 11) = 1
[pid 14456] wait4(-1, 0x7fffcb5560ec, WNOHANG, NULL) = 0
[pid 14456] wait4(-1, 0x7fffcb5560ec, WNOHANG, NULL) = 0
[pid 14456] kill(14482, SIGQUIT <unfinished ...>
[pid 14482] <... select resumed> ) = ? ERESTARTNOHAND (To be
restarted)
[pid 14456] <... kill resumed> ) = 0
[pid 14482] --- SIGQUIT (Quit) @ 0 (0) ---
[pid 14456] select(6, [5], NULL, NULL, {83, 0} <unfinished ...>
[pid 14482] rt_sigreturn(0x3) = -1 EINTR (Interrupted system call)
[pid 14482] sched_yield() = 0
[pid 14482] sched_yield() = 0
[pid 14482] sched_yield() = 0
[pid 14482] sched_yield() = 0
[pid 14482] sched_yield() = 0
[pid 14482] sched_yield() = 0
[pid 14482] sched_yield() = 0
[pid 14482] sched_yield() = 0
[pid 14482] sched_yield() = 0
[pid 14482] sched_yield() = 0
[pid 14482] sched_yield() = 0
[pid 14482] sched_yield() = 0
[pid 14482] sched_yield() = 0
[pid 14482] sched_yield() = 0
[pid 14482] sched_yield() = 0
[pid 14482] sched_yield() = 0
[pid 14482] sched_yield() = 0
[pid 14482] sched_yield() = 0
[pid 14482] sched_yield() = 0
[pid 14482] sched_yield() = 0
[pid 14482] sched_yield() = 0
[pid 14482] sched_yield() = 0
[pid 14482] sched_yield() = 0
[pid 14482] sched_yield() = 0
[pid 14482] sched_yield() = 0
[pid 14482] sched_yield() = 0
[pid 14482] sched_yield() = 0
[pid 14482] sched_yield() = 0
[pid 14482] sched_yield() = 0
[pid 14482] sched_yield() = 0
[pid 14482] sched_yield() = 0
[pid 14482] sched_yield() = 0
[pid 14482] sched_yield() = 0
[pid 14482] sched_yield() = 0
[pid 14482] sched_yield() = 0
[pid 14482] sched_yield() = 0
[pid 14482] sched_yield() = 0
[pid 14482] sched_yield() = 0
[pid 14482] sched_yield() = 0
[pid 14482] sched_yield() = 0
[pid 14482] sched_yield() = 0
[pid 14482] sched_yield() = 0
[pid 14482] sched_yield() = 0
[pid 14482] sched_yield() = 0
[pid 14482] sched_yield() = 0
[pid 14482] sched_yield() = 0
[pid 14482] sched_yield() = 0
[pid 14482] sched_yield() = 0
[pid 14482] sched_yield() = 0
[pid 14482] sched_yield() = 0
[pid 14482] sched_yield() = 0
[pid 14482] sched_yield() = 0
[pid 14482] sched_yield() = 0
[pid 14482] sched_yield() = 0
[pid 14482] sched_yield() = 0
[pid 14482] sched_yield() = 0
[pid 14482] sched_yield() = 0
[pid 14482] sched_yield() = 0
[pid 14482] sched_yield() = 0
[pid 14482] sched_yield() = 0
[pid 14482] sched_yield() = 0
[pid 14482] sched_yield() = 0
[pid 14482] sched_yield() = 0
[pid 14482] sched_yield() = 0
[pid 14482] sched_yield() = 0
[pid 14482] sched_yield() = 0
[pid 14482] sched_yield() = 0
[pid 14482] sched_yield() = 0
[pid 14482] sched_yield() = 0
[pid 14482] sched_yield() = 0
[pid 14482] sched_yield() = 0
[pid 14482] sched_yield() = 0
[pid 14482] sched_yield() = 0
[pid 14482] sched_yield() = 0
[pid 14482] sched_yield() = 0
[pid 14482] sched_yield() = 0
[pid 14482] sched_yield() = 0
[pid 14482] sched_yield() = 0
[pid 14482] sched_yield() = 0
[pid 14482] sched_yield() = 0
[pid 14482] sched_yield() = 0
[pid 14482] sched_yield() = 0
[pid 14482] sched_yield() = 0
[pid 14482] sched_yield() = 0
[pid 14482] sched_yield() = 0
[pid 14482] sched_yield() = 0
[pid 14482] sched_yield() = 0
[pid 14482] sched_yield() = 0
[pid 14482] sched_yield() = 0
[pid 14482] sched_yield() = 0
[pid 14482] sched_yield() = 0
[pid 14482] sched_yield() = 0
[pid 14482] sched_yield() = 0
[pid 14482] sched_yield() = 0
[pid 14482] sched_yield( <unfinished ...>
[pid 14485] rt_sigprocmask(SIG_SETMASK, ~[SEGV VTALRM RTMIN RT_1], NULL, 8) = 0
[pid 14485] rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
[pid 14482] <... sched_yield resumed> ) = 0
[pid 14482] close(3) = 0
[pid 14482] select(6, [3], NULL, [4 5], {22, 883208}) = -1 EBADF (Bad file
descriptor)
[pid 14482] rt_sigprocmask(SIG_SETMASK, ~[HUP INT QUIT KILL USR1 SEGV USR2 TERM
CHLD STOP TTIN TTOU VTALRM WINCH RTMIN RT_1], NULL, 8) = 0
[pid 14482] rt_sigprocmask(SIG_SETMASK, ~[HUP INT QUIT KILL USR1 SEGV USR2 TERM
CHLD STOP TTIN TTOU VTALRM WINCH RTMIN RT_1], NULL, 8) = 0
[pid 14482] rt_sigprocmask(SIG_SETMASK, ~[HUP INT QUIT KILL USR1 SEGV USR2 TERM
CHLD STOP TTIN TTOU VTALRM WINCH RTMIN RT_1], NULL, 8) = 0
[pid 14482] getrlimit(RLIMIT_STACK, {rlim_cur=8192*1024,
rlim_max=RLIM_INFINITY}) = 0
[pid 14485] _exit(0) = ?
Process 14485 detached
[pid 14482] rt_sigaction(SIGINT, {SIG_IGN, [], SA_RESTORER|SA_SIGINFO,
0x7f10515428f0}, {0x48fc00, [], SA_RESTORER|SA_SIGINFO, 0x7f10515428f0}, 8) = 0
[pid 14482] rt_sigaction(SIGINT, {SIG_DFL, [], SA_RESTORER|SA_SIGINFO,
0x7f10515428f0}, {SIG_IGN, [], SA_RESTORER|SA_SIGINFO, 0x7f10515428f0}, 8) = 0
[pid 14482] munmap(0x7f10517b4000, 4096) = 0
[pid 14482] close(4) = 0
[pid 14482] close(5) = 0
[pid 14482] close(7) = 0
[pid 14482] close(12) = 0
[pid 14482] unlink("/tmp/.java_pid14456") = -1 EPERM (Operation not
permitted)
[pid 14482] exit_group(0) = ?
Process 14482 detached
[pid 14463] --- SIGCHLD (Child exited) @ 0 (0) ---
[pid 14463] rt_sigreturn(0x11) = -1 EINTR (Interrupted system call)
[pid 14483] rt_sigprocmask(SIG_SETMASK, ~[SEGV VTALRM RTMIN RT_1], NULL, 8) = 0
[pid 14483] rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
[pid 14483] tgkill(14456, 14456, SIGVTALRM) = 0
[pid 14456] <... select resumed> ) = ? ERESTARTNOHAND (To be
restarted)
[pid 14456] --- SIGVTALRM (Virtual timer expired) @ 0 (0) ---
[pid 14456] rt_sigreturn(0x1a) = -1 EINTR (Interrupted system call)
[pid 14456] fcntl(6, F_GETFL) = 0x801 (flags O_WRONLY|O_NONBLOCK)
[pid 14456] write(6, ".", 1) = 1
[pid 14456] select(6, [5], NULL, NULL, {82, 908414}) = 1 (in [5], left {82,
908410})
[pid 14456] fcntl(5, F_GETFL) = 0x800 (flags O_RDONLY|O_NONBLOCK)
[pid 14456] read(5, ".", 11) = 1
[pid 14456] wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG,
NULL) = 14482
[pid 14456] write(2, "reaped #<Process::Status: pid 14"..., 53) =
53
[pid 14456] wait4(-1, 0x7fffcb5560ec, WNOHANG, NULL) = 0
[pid 14456] select(6, [5], NULL, NULL, {91, 0}
<unfinished ...>
[pid 14463] --- SIGUSR2 (User defined signal 2) @ 0 (0) ---
[pid 14463] rt_sigreturn(0x20) = 202
-----Original Message-----
From: Eric Wong [mailto:normalperson at yhbt.net]
Sent: Friday, March 09, 2012 4:03 PM
To: unicorn list
Cc: Yeung, Jeffrey
Subject: Re: Unicorn_rails ignores USR2 signal
"Yeung, Jeffrey" <Jeffrey.Yeung at polycom.com>
wrote:> Thanks Eric. New strace capture as follows:
>
> $ sudo strace -f -e ''!futex'' -p 14255 Process 14255
attached with 12
> threads - interrupt to quit
>From that, you have 5 worker processes?
For debugging this, it can cut down on noise to only use one worker process,
too. You can check if SIGTTOU works, too :)
Also, can you reproduce this on a freshly-started master? Or has the master
been running and handling other signals for a while?
The most common cause of USR2 failures is due to an executable or library being
moved/replaced/upgraded away (sometimes due to Capistrano), but that should
definitely get logged and doesn''t seem to be the case for you.
> [pid 14322] restart_syscall(<... resuming interrupted call ...>
> <unfinished ...> [pid 14271] restart_syscall(<... resuming
> interrupted call ...> <unfinished ...> [pid 14267] accept(12,
> <unfinished ...> [pid 14264] restart_syscall(<... resuming
> interrupted call ...> <unfinished ...> [pid 14255] select(6, [5],
> NULL, NULL, {42, 86504} <unfinished ...> [pid 14322] <...
> restart_syscall resumed> ) = -1 ETIMEDOUT (Connection timed out) [pid
> 14271] <... restart_syscall resumed> ) = -1 ETIMEDOUT (Connection
> timed out) [pid 14264] <... restart_syscall resumed> ) = -1
ETIMEDOUT (Connection timed out) [pid 14262] --- SIGUSR2 (User defined signal
2) @ 0 (0) ---
> [pid 14262] rt_sigreturn(0x20) = 202
>
> The last two lines do not say much for the event. :(
Anything more after that? What happens when you send a previously working
signal (USR1, HUP) after sending a failed USR2 to that process?
"Yeung, Jeffrey" <Jeffrey.Yeung at polycom.com> wrote:> I have it configured for 4 workers. I just turned it down to 1 worker > and tested both TTIN and TTOU. They both work, creating and killing > workers, respectively. The following is the strace capture from an > TTOU signal, followed by a USR2 signal. Yes, that is the only output > from the USR2 signal. Behavior doesn''t change if the master is > freshly started, or has been running for a while. One other thing I > did not mention earlier is that the <pid>.oldbin file never gets > created on the USR2 signal, but that''s probably obvious already.I also asked for sending another signal after USR2, can you try that? However, there''s another possibility I hadn''t considered, what if you disable preload_app? Your app or some libs it uses may be intercepting USR2 for something it does. Maybe this patch can work, but this may also silently break your application/lib, too... diff --git a/lib/unicorn/http_server.rb b/lib/unicorn/http_server.rb index 7d2c623..1b9d693 100644 --- a/lib/unicorn/http_server.rb +++ b/lib/unicorn/http_server.rb @@ -128,8 +128,7 @@ class Unicorn::HttpServer # setup signal handlers before writing pid file in case people get # trigger happy and send signals as soon as the pid file exists. # Note that signals don''t actually get handled until the #join method - QUEUE_SIGS.each { |sig| trap(sig) { SIG_QUEUE << sig; awaken_master } } - trap(:CHLD) { awaken_master } + master_siginit self.pid = config[:pid] self.master_pid = $$ @@ -689,6 +688,9 @@ class Unicorn::HttpServer Gem.refresh end self.app = app.call + + # override signal handlers the app may have set + master_siginit if preload_app end end @@ -736,4 +738,9 @@ class Unicorn::HttpServer config_listeners.each { |addr| listen(addr) } raise ArgumentError, "no listeners" if LISTENERS.empty? end + + def master_siginit + QUEUE_SIGS.each { |sig| trap(sig) { SIG_QUEUE << sig; awaken_master } } + trap(:CHLD) { awaken_master } + end end
Eric Wong <normalperson at yhbt.net> wrote:> However, there''s another possibility I hadn''t considered, what if you > disable preload_app? Your app or some libs it uses may be intercepting > USR2 for something it does.Ping? Was this it?
Hi Eric,
Sorry for the delay. It looks like disabling preload_app did the trick. A new
master was created after sending the USR2. Now the $$$ question is, what in the
world is intercepting the signal? :S
And this is a little late now, but here''s the strace of a USR2 signal
followed by a TTOU signal, with preload_app true (latter signal working okay in
this case):
[pid 14542] --- SIGUSR2 (User defined signal 2) @ 0 (0) ---
[pid 14542] rt_sigreturn(0x20) = 202
[pid 14535] <... select resumed> ) = 0 (Timeout)
[pid 14535] wait4(-1, 0x7fff6bcccb9c, WNOHANG, NULL) = 0
[pid 14535] select(6, [5], NULL, NULL, {60, 0} <unfinished ...>
[pid 14542] --- SIGTTOU (Stopped (tty output)) @ 0 (0) ---
[pid 14542] rt_sigreturn(0x16) = -1 EINTR (Interrupted system call)
[pid 14553] rt_sigprocmask(SIG_SETMASK, ~[SEGV VTALRM RTMIN RT_1], NULL, 8) = 0
[pid 14553] rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
[pid 14553] tgkill(14535, 14535, SIGVTALRM) = 0
[pid 14535] <... select resumed> ) = ? ERESTARTNOHAND (To be
restarted)
[pid 14535] --- SIGVTALRM (Virtual timer expired) @ 0 (0) ---
[pid 14535] rt_sigreturn(0x1a) = -1 EINTR (Interrupted system call)
[pid 14535] fcntl(6, F_GETFL) = 0x1 (flags O_WRONLY)
[pid 14535] fcntl(6, F_SETFL, O_WRONLY|O_NONBLOCK) = 0
[pid 14535] write(6, ".", 1) = 1
[pid 14535] select(6, [5], NULL, NULL, {59, 939700}) = 1 (in [5], left {59,
939696})
[pid 14535] fcntl(5, F_GETFL) = 0 (flags O_RDONLY)
[pid 14535] fcntl(5, F_SETFL, O_RDONLY|O_NONBLOCK) = 0
[pid 14535] read(5, ".", 11) = 1
[pid 14535] wait4(-1, 0x7fff6bcccb9c, WNOHANG, NULL) = 0
[pid 14535] wait4(-1, 0x7fff6bcccb9c, WNOHANG, NULL) = 0
[pid 14535] kill(14552, SIGQUIT) = 0
[pid 14535] select(6, [5], NULL, NULL, {60, 0} <unfinished ...>
[pid 14542] --- SIGCHLD (Child exited) @ 0 (0) ---
[pid 14542] rt_sigreturn(0x11) = -1 EINTR (Interrupted system call)
[pid 14553] rt_sigprocmask(SIG_SETMASK, ~[SEGV VTALRM RTMIN RT_1], NULL, 8) = 0
[pid 14553] rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
[pid 14553] tgkill(14535, 14535, SIGVTALRM) = 0
[pid 14535] <... select resumed> ) = ? ERESTARTNOHAND (To be
restarted)
[pid 14535] --- SIGVTALRM (Virtual timer expired) @ 0 (0) ---
[pid 14535] rt_sigreturn(0x1a) = -1 EINTR (Interrupted system call)
[pid 14535] fcntl(6, F_GETFL) = 0x801 (flags O_WRONLY|O_NONBLOCK)
[pid 14535] write(6, ".", 1) = 1
[pid 14535] select(6, [5], NULL, NULL, {59, 880004}) = 1 (in [5], left {59,
880001})
[pid 14535] fcntl(5, F_GETFL) = 0x800 (flags O_RDONLY|O_NONBLOCK)
[pid 14535] read(5, ".", 11) = 1
[pid 14535] wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG,
NULL) = 14552
[pid 14535] write(2, "reaped #<Process::Status: pid 14"..., 53) =
53
[pid 14535] wait4(-1, 0x7fff6bcccb9c, WNOHANG, NULL) = -1 ECHILD (No child
processes)
[pid 14535] rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
[pid 14535] select(6, [5], NULL, NULL, {119, 0}
-----Original Message-----
From: Eric Wong [mailto:normalperson at yhbt.net]
Sent: Monday, March 12, 2012 2:21 PM
To: unicorn list
Cc: Yeung, Jeffrey
Subject: Re: Unicorn_rails ignores USR2 signal
Eric Wong <normalperson at yhbt.net> wrote:> However, there''s another possibility I hadn''t considered,
what if you
> disable preload_app? Your app or some libs it uses may be
> intercepting
> USR2 for something it does.
Ping? Was this it?
"Yeung, Jeffrey" <Jeffrey.Yeung at polycom.com> wrote:> Sorry for the delay. It looks like disabling preload_app did the > trick. A new master was created after sending the USR2. Now the $$$ > question is, what in the world is intercepting the signal? :SGood to know, I''d just grep the installation directories for all your Ruby libs + gems for USR2. I haven''t seen this problem before, but it''d be good to document the conflict, at least.
Eric Wong <normalperson at yhbt.net> wrote:> "Yeung, Jeffrey" <Jeffrey.Yeung at polycom.com> wrote: > > Sorry for the delay. It looks like disabling preload_app did the > > trick. A new master was created after sending the USR2. Now the $$$ > > question is, what in the world is intercepting the signal? :S > > Good to know, I''d just grep the installation directories for all your > Ruby libs + gems for USR2. I haven''t seen this problem before, but it''d > be good to document the conflict, at least.Btw, did you ever figure out what was causing the conflict? Pushing this out to git://bogomips.org/unicorn.git>From 1e13ffee3469997286e65e0563b6433e7744388a Mon Sep 17 00:00:00 2001From: Eric Wong <normalperson at yhbt.net> Date: Tue, 20 Mar 2012 19:51:35 +0000 Subject: [PATCH] KNOWN_ISSUES: document signal conflicts in libs/apps Jeffrey Yeung confirmed this issue on the mailing list. ref: <E8D9E7CCC2621343A0A3BB45E8DEDFA91C682DD23D at CRPMBOXPRD04.polycom.com> --- KNOWN_ISSUES | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/KNOWN_ISSUES b/KNOWN_ISSUES index f323c68..38263e7 100644 --- a/KNOWN_ISSUES +++ b/KNOWN_ISSUES @@ -3,6 +3,11 @@ Occasionally odd {issues}[link:ISSUES.html] arise without a transparent or acceptable solution. Those issues are documented here. +* Some libraries/applications may install signal handlers which conflict + with signal handlers unicorn uses. Leaving "preload_app false" + (the default) will allow unicorn to always override existing signal + handlers. + * Issues with FreeBSD jails can be worked around as documented by Tatsuya Ono: http://mid.gmane.org/CAHBuKRj09FdxAgzsefJWotexw-7JYZGJMtgUp_dhjPz9VbKD6Q at mail.gmail.com
Eric, I have been unable to narrow down the cause of the conflict so far. The list of Ruby gems (and gem versions) on the affected deployment are identical to the ones on another deployment where Unicorn is upgrading just fine (with preloaded app). Grep''ing for USR2 in the gem installations did not reveal anything, unfortunately. Since then, I haven''t been able to spend further time investigating. Not sure where else to look, really, but I''m open to further suggestions. -Jeff -----Original Message----- From: Eric Wong [mailto:normalperson at yhbt.net] Sent: Tuesday, March 20, 2012 12:58 PM To: unicorn list Cc: Yeung, Jeffrey Subject: Re: Unicorn_rails ignores USR2 signal Eric Wong <normalperson at yhbt.net> wrote:> "Yeung, Jeffrey" <Jeffrey.Yeung at polycom.com> wrote: > > Sorry for the delay. It looks like disabling preload_app did the > > trick. A new master was created after sending the USR2. Now the > > $$$ question is, what in the world is intercepting the signal? :S > > Good to know, I''d just grep the installation directories for all your > Ruby libs + gems for USR2. I haven''t seen this problem before, but > it''d be good to document the conflict, at least.Btw, did you ever figure out what was causing the conflict? Pushing this out to git://bogomips.org/unicorn.git>From 1e13ffee3469997286e65e0563b6433e7744388a Mon Sep 17 00:00:00 2001From: Eric Wong <normalperson at yhbt.net> Date: Tue, 20 Mar 2012 19:51:35 +0000 Subject: [PATCH] KNOWN_ISSUES: document signal conflicts in libs/apps Jeffrey Yeung confirmed this issue on the mailing list. ref: <E8D9E7CCC2621343A0A3BB45E8DEDFA91C682DD23D at CRPMBOXPRD04.polycom.com> --- KNOWN_ISSUES | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/KNOWN_ISSUES b/KNOWN_ISSUES index f323c68..38263e7 100644 --- a/KNOWN_ISSUES +++ b/KNOWN_ISSUES @@ -3,6 +3,11 @@ Occasionally odd {issues}[link:ISSUES.html] arise without a transparent or acceptable solution. Those issues are documented here. +* Some libraries/applications may install signal handlers which +conflict + with signal handlers unicorn uses. Leaving "preload_app false" + (the default) will allow unicorn to always override existing signal + handlers. + * Issues with FreeBSD jails can be worked around as documented by Tatsuya Ono: http://mid.gmane.org/CAHBuKRj09FdxAgzsefJWotexw-7JYZGJMtgUp_dhjPz9VbKD6Q at mail.gmail.com
On 3/20/12 4:09 PM, Yeung, Jeffrey wrote:> I have been unable to narrow down the cause of the conflict so far. The list of Ruby gems (and gem versions) on the affected deployment are identical to the ones on another deployment where Unicorn is upgrading just fine (with preloaded app). Grep''ing for USR2 in the gem installations did not reveal anything, unfortunately. Since then, I haven''t been able to spend further time investigating. Not sure where else to look, really, but I''m open to further suggestions.Jeffery, To uncover the culprit, you might try monkey-patching Kernel.trap and Signal.trap so it logs the last few entries in caller when it''s called with USR2. Put something like this really early in you app bootstrap: [Kernel,Signal].each |klass| class << klass alias :orig_trap :trap def trap *args, &block if args.first.to_s =~ /USR2$/i || args.first.to_i == 31 $stderr.puts "Caught someone trapping USR2 caller is:", caller.last(2) end orig_trap *args, &block end end end
I have little to report, unfortunately. I still have not identified what is
trapping USR2 in my application, if something is indeed doing that.
I''ve attempted to do a bit more investigation work in my spare time
this week, and the results are still rather mysterious.
First, I applied your patch to http_server.rb that you suggested in this thread
back on March 9th (quoted below), which re-overrides the signal handlers in
build_app!(). The patch did work. Okay, that seems to reconfirm that the app
is doing something with the USR2 signal handler...
Then I modified the master_siginit() method a bit further to provide more info:
def master_siginit
QUEUE_SIGS.each { |sig|
old_handler = trap(sig) { SIG_QUEUE << sig; awaken_master }
logger.info("old handler for #{sig.to_s}:
#{old_handler.inspect}")
}
trap(:CHLD) { awaken_master }
end
And the output was a bit surprising. Here is the Unicorn stderr log when
starting up the service. master_siginit() is called twice, as expected...
I, [2012-03-30T14:57:51.301839 #15828] INFO -- : listening on addr=0.0.0.0:8080
fd=3
I, [2012-03-30T14:57:51.302341 #15828] INFO -- : old handler for WINCH: nil
I, [2012-03-30T14:57:51.302435 #15828] INFO -- : old handler for QUIT:
"DEFAULT"
I, [2012-03-30T14:57:51.302505 #15828] INFO -- : old handler for INT:
"DEFAULT"
I, [2012-03-30T14:57:51.302573 #15828] INFO -- : old handler for TERM:
"DEFAULT"
I, [2012-03-30T14:57:51.302641 #15828] INFO -- : old handler for USR1:
"DEFAULT"
I, [2012-03-30T14:57:51.302709 #15828] INFO -- : old handler for USR2:
"DEFAULT"
I, [2012-03-30T14:57:51.302786 #15828] INFO -- : old handler for HUP:
"DEFAULT"
I, [2012-03-30T14:57:51.302853 #15828] INFO -- : old handler for TTIN: nil
I, [2012-03-30T14:57:51.302917 #15828] INFO -- : old handler for TTOU: nil
I, [2012-03-30T14:57:51.303502 #15828] INFO -- : Refreshing Gem list
old handler for WINCH:
#<Proc:0x00000002d97488@/usr/local/lib/ruby/gems/1.9.1/gems/unicorn-4.2.0/lib/unicorn/http_server.rb:745>
old handler for QUIT:
#<Proc:0x00000002d97098@/usr/local/lib/ruby/gems/1.9.1/gems/unicorn-4.2.0/lib/unicorn/http_server.rb:745>
old handler for INT:
#<Proc:0x00000002d96c70@/usr/local/lib/ruby/gems/1.9.1/gems/unicorn-4.2.0/lib/unicorn/http_server.rb:745>
old handler for TERM:
#<Proc:0x00000002d96848@/usr/local/lib/ruby/gems/1.9.1/gems/unicorn-4.2.0/lib/unicorn/http_server.rb:745>
old handler for USR1:
#<Proc:0x00000002d96420@/usr/local/lib/ruby/gems/1.9.1/gems/unicorn-4.2.0/lib/unicorn/http_server.rb:745>
old handler for USR2:
#<Proc:0x00000002d95ff8@/usr/local/lib/ruby/gems/1.9.1/gems/unicorn-4.2.0/lib/unicorn/http_server.rb:745>
old handler for HUP:
#<Proc:0x00000002d95bd0@/usr/local/lib/ruby/gems/1.9.1/gems/unicorn-4.2.0/lib/unicorn/http_server.rb:745>
old handler for TTIN:
#<Proc:0x00000002d957a8@/usr/local/lib/ruby/gems/1.9.1/gems/unicorn-4.2.0/lib/unicorn/http_server.rb:745>
old handler for TTOU:
#<Proc:0x00000002d953b8@/usr/local/lib/ruby/gems/1.9.1/gems/unicorn-4.2.0/lib/unicorn/http_server.rb:745>
master process ready
worker=0 ready
Which to me, looks like the USR2 handler was unchanged in the second
master_siginit() call. Yet, if I remove that second master_siginit() call from
there, USR2 ceases to work. :S
-Jeff
-----Original Message-----
diff --git a/lib/unicorn/http_server.rb b/lib/unicorn/http_server.rb index
7d2c623..1b9d693 100644
--- a/lib/unicorn/http_server.rb
+++ b/lib/unicorn/http_server.rb
@@ -128,8 +128,7 @@ class Unicorn::HttpServer
# setup signal handlers before writing pid file in case people get
# trigger happy and send signals as soon as the pid file exists.
# Note that signals don''t actually get handled until the #join
method
- QUEUE_SIGS.each { |sig| trap(sig) { SIG_QUEUE << sig; awaken_master }
}
- trap(:CHLD) { awaken_master }
+ master_siginit
self.pid = config[:pid]
self.master_pid = $$
@@ -689,6 +688,9 @@ class Unicorn::HttpServer
Gem.refresh
end
self.app = app.call
+
+ # override signal handlers the app may have set
+ master_siginit if preload_app
end
end
@@ -736,4 +738,9 @@ class Unicorn::HttpServer
config_listeners.each { |addr| listen(addr) }
raise ArgumentError, "no listeners" if LISTENERS.empty?
end
+
+ def master_siginit
+ QUEUE_SIGS.each { |sig| trap(sig) { SIG_QUEUE << sig; awaken_master }
}
+ trap(:CHLD) { awaken_master }
+ end
end
Hey Jeffrey, we had a similar problem about 8 months ago. What version of ruby (including patch number) are you using? I think you said you''re not using anything special in your deployment environment, right? You''re not using god, by chance to do the restarts, are you? Also, how are you doing your deployments, if not with cap? -- Alex Sharp Zaarly, Inc | @ajsharp | github.com/ajsharp | alexjsharp.com
We''ve run into the exact same problem. We spent a few good solid hours
trying to
fix the problem with no avail (but we do have a ''solution'').
Server specs:
Ubuntu 12.04.2 LTS (GNU/Linux 3.5.0-27-generic x86_64)
Rails & Ruby:
Rails 4.0.0.beta1
Ruby 2.0.0p0 (2013-02-24 revision 39474) [x86_64-linux]
Unicorn 4.6.2 - preload_app true
Capistrano
The first thing we tried was removing all of the gems (foreman, sidekiq, etc.)
and that didn''t work. We tried preload_app false, and that worked (but
this is
not the situation we want). We also tried searching for any gems that were
capturing USR2 (or any signals in general).
We tried monkey patching Kernel & Trap to see if we could pinpoint the
problem,
but that didn''t seem to turn up much. We then jumped in
unicorn''s http_server.rb
and threw in some logging. The only thing that seemed to show us was that USR2
was unable to get through to unicorn. Signals HUP, QUIT, etc. all worked fine
(and made their way through as per norm).
We then fired up strace with unicorn to see if we could ''see''
the USR2 signal.
It looks almost identical to Jeffery''s strace. It looks like the USR2
signal is
received by the process, but then dropped (or ignored). If we pass it a QUIT
signal it works just fine and we can see that in the strace.
So here''s the chain of events that we think is occuring:
1. We start unicorn (a unicorn blessing)
2. before_fork is triggered (Signal trapping works here)
3. A unicorn is born (Signal trapping does not work here)
4. The after fork is triggered
In our situation. Once a unicorn is born, it can no longer received a USR2 (for
whatever reason). We never actually checked to see if the after fork received a
USR2.
After we managed to figure out the path of execution, we figured out a quick
fix. We actually call reexec method directly on the unicorn server in the
before_fork block in unicorn.rb. Here''s our unicorn.rb
https://gist.github.com/RyanonRails/5541902
----- snip -----
before_fork do |server, worker|
Signal.trap ''USR2'' do
puts ''Since a black hole is lurking and eating USR2s we will hit
http_server#reexec ourselves''
server.send(:reexec)
end
end
----- snip -----
This way we can actually force the USR2 directly on the server (since USR2 is
available inside of the before_fork).
This seems to be working well for us now. We''re still frustrated that
we weren''t
able to find the cause of the problem, but we least we can move forward (with
minor disruption to the code/code base).
Lastly, here are our current theories as to why this could be occuring:
1. When unicorn is building it''s native extensions something weird is
happening
in regards to the USR2 signal
2. A gem is somehow gobbling up the signal (we were quite thorough, but we
might''ve missed something)
Hopefully this helps!
Thanks,
Ryan & Mike
_______________________________________________
Unicorn mailing list - mongrel-unicorn@rubyforge.org
http://rubyforge.org/mailman/listinfo/mongrel-unicorn
Do not quote signatures (like this one) or top post when replying