thr3ads.net - mongrel unicorn - Strange quit behavior [Aug 2011]

If this information is useful, please help other people find it:
Share via:

James Cox

2011-Aug-02 20:09 UTC

Strange quit behavior

Hey,

So here are some tasks for managing unicorn:

https://gist.github.com/1121076

I''ve found that it''s very unreliable for quitting /
terminating
unicorn and restarting with new code. When doing rapid deployments
particularly, i''ve found that i have to go in and kill -9 the master
process, and start again.

any thoughts on why it seems ineffective?

Thanks.

PS: here''s the unicorn config:

https://gist.github.com/0dd07c5ad00c56d161c7

James

Alex Sharp

2011-Aug-02 20:34 UTC

head link

Strange quit behavior

> I''ve found that it''s very unreliable for quitting /
terminating
> unicorn and restarting with new code. When doing rapid deployments
> particularly, i''ve found that i have to go in and kill -9 the
master
> process, and start again.
We''ve noticed some of this behavior as well. We''ve seen the
new master
start spinning and consume 100% cpu (according to top). The old master
and workers are still running and working, but the new master just
hangs, and we have to kill -9. Our solution was to add the following
to our unicorn config, which *seems* to have solved the problem:

Unicorn::HttpServer::START_CTX[0]
"#{path_to_app}/shared/bundle/ruby/1.9.1/bin/unicorn_rails"

This is outlined a bit here: http://unicorn.bogomips.org/Sandbox.html.
I''m not really sure why this seems to have fixed the problem -- all I
can tell you is that we haven''t seen it since adding this line to the
config.

- Alex

cliftonk at gmail.com

2011-Aug-02 20:45 UTC

head link

Strange quit behavior

Your config has:

    sig = (worker.nr + 1) >= server.worker_processes ? :TERM : :TTOU
    Process.kill(sig, File.read(old_pid).to_i)

I don''t have such a conditional in mine, and I use a QUIT signal
(although I don''t see why TERM isn''t working for you).

    Process.kill("QUIT", File.read(old_pid).to_i)

You might also want to try logging a puts File.read(old_pid) to the
unicorn.stdout.log to ensure the old_pid file is being correctly written to (and
in the right path).

Eric Wong

2011-Aug-02 21:53 UTC

head link

Strange quit behavior

James Cox <james at imaj.es> wrote:> Hey,
> 
> So here are some tasks for managing unicorn:
> 
> https://gist.github.com/1121076
Can we ignore the :restart task?   It''s a bit fragile since it
doesn''t
wait for the old process to die (SIGTERM means
kill-as-quickly-as-possible, but given a loaded system it can still take
some time).
> I''ve found that it''s very unreliable for quitting /
terminating
> unicorn and restarting with new code. When doing rapid deployments
> particularly, i''ve found that i have to go in and kill -9 the
master
> process, and start again.
If you SIGQUIT/SIGTERM before the app is loaded, the signal could
be ignored.  This behavior should probably change...
> any thoughts on why it seems ineffective?
> 
> Thanks.
Which version of Unicorn are you using?  I recall some minor tweaks
to avoid/minimize race conditions over the years so maybe some
are fixed.
> PS: here''s the unicorn config:
> 
> https://gist.github.com/0dd07c5ad00c56d161c7
Avoid the top piece of the before_fork hook to TTOU workers, it''s
needlessly complex for most deployments unless you''re really
memory-constrained.

-- 
Eric Wong

Eric Wong

2011-Aug-02 21:54 UTC

head link

Strange quit behavior

Alex Sharp <ajsharp at gmail.com> wrote:> > I''ve found that it''s very unreliable for quitting /
terminating
> > unicorn and restarting with new code. When doing rapid deployments
> > particularly, i''ve found that i have to go in and kill -9 the
master
> > process, and start again.
> 
> We''ve noticed some of this behavior as well. We''ve seen
the new master
> start spinning and consume 100% cpu (according to top). The old master
> and workers are still running and working, but the new master just
> hangs, and we have to kill -9.
Can you try to strace (or equivalent) the old master to see what''s
using
100% CPU?

-- 
Eric Wong

James Cox

2011-Aug-02 22:46 UTC

head link

Strange quit behavior

On Tue, Aug 2, 2011 at 17:53, Eric Wong <normalperson at yhbt.net>
wrote:> James Cox <james at imaj.es> wrote:
>> Hey,
>>
>> So here are some tasks for managing unicorn:
>>
>> https://gist.github.com/1121076
>
> Can we ignore the :restart task? ? It''s a bit fragile since it
doesn''t
> wait for the old process to die (SIGTERM means
> kill-as-quickly-as-possible, but given a loaded system it can still take
> some time).
We mostly restart (what surprise). What pattern works best here for
that? (speed of deploy is important, but definitely assume a long boot
time)

>
>> I''ve found that it''s very unreliable for quitting /
terminating
>> unicorn and restarting with new code. When doing rapid deployments
>> particularly, i''ve found that i have to go in and kill -9 the
master
>> process, and start again.
>
> If you SIGQUIT/SIGTERM before the app is loaded, the signal could
> be ignored. ?This behavior should probably change...
>
>> any thoughts on why it seems ineffective?
>>
>> Thanks.
>
> Which version of Unicorn are you using? ?I recall some minor tweaks
> to avoid/minimize race conditions over the years so maybe some
> are fixed.
>gem ''unicorn'' - so whatever seems up to date. My lock says
4.0.1
>> PS: here''s the unicorn config:
>>
>> https://gist.github.com/0dd07c5ad00c56d161c7
>
> Avoid the top piece of the before_fork hook to TTOU workers, it''s
> needlessly complex for most deployments unless you''re really
> memory-constrained.
>
So what should that look like? all but that nr-workers stuff? can i
just remove the before fork? and what would you say is a super good
unicorn config to start from?

thanks!
james

Eric Wong

2011-Aug-02 23:08 UTC

head link

Strange quit behavior

James Cox <james at imaj.es> wrote:> On Tue, Aug 2, 2011 at 17:53, Eric Wong <normalperson at yhbt.net>
wrote:
> > James Cox <james at imaj.es> wrote:
> >> Hey,
> >>
> >> So here are some tasks for managing unicorn:
> >>
> >> https://gist.github.com/1121076
> >
> > Can we ignore the :restart task? ? It''s a bit fragile since
it doesn''t
> > wait for the old process to die (SIGTERM means
> > kill-as-quickly-as-possible, but given a loaded system it can still
take
> > some time).
> 
> We mostly restart (what surprise). What pattern works best here for
> that? (speed of deploy is important, but definitely assume a long boot
> time)
Yeah, if you do a full shutdown, I would definitely wait a little
for the old process to shut down (wait for pid to disappear), first.

However, full shutdown means you drop connections, so I would use
USR2 followed by QUIT (on same pid, USR2 will replace the current
pid file).  See http://unicorn.bogomips.org/SIGNALS.html
> So what should that look like? all but that nr-workers stuff? can i
> just remove the before fork? and what would you say is a super good
> unicorn config to start from?
Yeah, skip the before_fork and also after_fork.  Those are mainly for
disconnect/reconnect of ActiveRecord and anything else that might need a
network connection.

I try to have as little-as-possible in my unicorn config.

-- 
Eric Wong

Alex Sharp

2011-Aug-02 23:49 UTC

head link

Strange quit behavior

On Tue, Aug 2, 2011 at 4:08 PM, Eric Wong <normalperson at yhbt.net>
wrote:> James Cox <james at imaj.es> wrote:
>> So what should that look like? all but that nr-workers stuff? can i
>> just remove the before fork? and what would you say is a super good
>> unicorn config to start from?
>
> Yeah, skip the before_fork and also after_fork. ?Those are mainly for
> disconnect/reconnect of ActiveRecord and anything else that might need a
> network connection.
FWIW, we use the before_fork hook to automatically kill the old master
by sending it QUIT:

before_fork do |server, worker|
  old_pid = ''/var/www/api/shared/pids/unicorn.pid.oldbin''

  if File.exists?(old_pid) && server.pid != old_pid
    begin
      Process.kill("QUIT", File.read(old_pid).to_i)
    rescue Errno::ENOENT, Errno::ESRCH
      # another newly forked workers has already killed it
    end
  end
end

- alex

Eric Wong

2011-Aug-03 00:34 UTC

head link

Strange quit behavior

Eric Wong <normalperson at yhbt.net> wrote:> If you SIGQUIT/SIGTERM before the app is loaded, the signal could
> be ignored.  This behavior should probably change...
I pushed the following to git://bogomips.org/unicorn
>From 406b8b0e2ed6e5be34d8ec3cd4b16048233c2856 Mon Sep 17 00:00:00 2001From: Eric Wong <normalperson at yhbt.net>
Date: Tue, 2 Aug 2011 23:52:14 +0000
Subject: [PATCH] trap death signals in the worker sooner

This helps close a race condition preventing shutdown if
loading the application (preload_app=false) takes a long
time and the user decides to kil workers instead.
---
 lib/unicorn/http_server.rb |    8 ++++++--
 1 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/lib/unicorn/http_server.rb b/lib/unicorn/http_server.rb
index ad5d6f0..565f132 100644
--- a/lib/unicorn/http_server.rb
+++ b/lib/unicorn/http_server.rb
@@ -535,12 +535,17 @@ class Unicorn::HttpServer
     handle_error(client, e)
   end
 
+  EXIT_SIGS = [ :QUIT, :TERM, :INT ]
+  WORKER_QUEUE_SIGS = QUEUE_SIGS - EXIT_SIGS
+
   # gets rid of stuff the worker has no business keeping track of
   # to free some resources and drops all sig handlers.
   # traps for USR1, USR2, and HUP may be set in the after_fork Proc
   # by the user.
   def init_worker_process(worker)
-    QUEUE_SIGS.each { |sig| trap(sig, nil) }
+    # we''ll re-trap :QUIT later for graceful shutdown iff we accept
clients
+    EXIT_SIGS.each { |sig| trap(sig) { exit!(0) } }
+    WORKER_QUEUE_SIGS.each { |sig| trap(sig, nil) }
     trap(:CHLD, ''DEFAULT'')
     SIG_QUEUE.clear
     proc_name "worker[#{worker.nr}]"
@@ -578,7 +583,6 @@ class Unicorn::HttpServer
     # closing anything we IO.select on will raise EBADF
     trap(:USR1) { nr = -65536; SELF_PIPE[0].close rescue nil }
     trap(:QUIT) { worker = nil; LISTENERS.each { |s| s.close rescue nil }.clear
}
-    [:TERM, :INT].each { |sig| trap(sig) { exit!(0) } } # instant shutdown
     logger.info "worker=#{worker.nr} ready"
 
     begin
-- 
Eric Wong

Eric Wong

2011-Aug-03 01:36 UTC

head link

Strange quit behavior

Eric Wong <normalperson at yhbt.net> wrote:> Eric Wong <normalperson at yhbt.net> wrote:
> > If you SIGQUIT/SIGTERM before the app is loaded, the signal could
> > be ignored.  This behavior should probably change...
> 
> I pushed the following to git://bogomips.org/unicorn
> 
> From 406b8b0e2ed6e5be34d8ec3cd4b16048233c2856 Mon Sep 17 00:00:00 2001
Also pushed to rubygems.org for easier testing:

  gem install --pre unicorn -v 4.0.1.4.g406b

This will probably be unicorn 4.0.2 shortly.

Alex Sharp

2011-Aug-05 04:09 UTC

head link

Strange quit behavior

On Tue, Aug 2, 2011 at 2:54 PM, Eric Wong <normalperson at yhbt.net>
wrote:> Can you try to strace (or equivalent) the old master to see what''s
using
> 100% CPU?
>
All I see is a whole lot of this:

sched_yield()                           = 0
sched_yield()                           = 0
sched_yield()                           = 0
sched_yield()                           = 0
sched_yield()                           = 0
sched_yield()                           = 0
sched_yield()                           = 0
sched_yield()                           = 0

- alex

Alex Sharp

2011-Aug-05 04:12 UTC

head link

Strange quit behavior

On Thu, Aug 4, 2011 at 9:09 PM, Alex Sharp <ajsharp at gmail.com>
wrote:> On Tue, Aug 2, 2011 at 2:54 PM, Eric Wong <normalperson at yhbt.net>
wrote:
>> Can you try to strace (or equivalent) the old master to see
what''s using
>> 100% CPU?
>>
>
> All I see is a whole lot of this:
>
> sched_yield() ? ? ? ? ? ? ? ? ? ? ? ? ? = 0
> sched_yield() ? ? ? ? ? ? ? ? ? ? ? ? ? = 0
> sched_yield() ? ? ? ? ? ? ? ? ? ? ? ? ? = 0
> sched_yield() ? ? ? ? ? ? ? ? ? ? ? ? ? = 0
> sched_yield() ? ? ? ? ? ? ? ? ? ? ? ? ? = 0
> sched_yield() ? ? ? ? ? ? ? ? ? ? ? ? ? = 0
> sched_yield() ? ? ? ? ? ? ? ? ? ? ? ? ? = 0
> sched_yield() ? ? ? ? ? ? ? ? ? ? ? ? ? = 0
Actually, my fault -- the last email was the output of new master.
Running strace on the old master shows the following:

select(4, [3], NULL, NULL, {13, 466686}) = 0 (Timeout)
wait4(-1, 0x7fff57e7bfcc, WNOHANG, NULL) = 0
clock_gettime(CLOCK_REALTIME, {1312517345, 425411398}) = 0
fstat(9, {st_mode=S_IFREG, st_size=0, ...}) = 0
clock_gettime(CLOCK_REALTIME, {1312517345, 425625251}) = 0
fstat(11, {st_mode=S_IFREG, st_size=0, ...}) = 0
clock_gettime(CLOCK_REALTIME, {1312517345, 425779281}) = 0
fstat(12, {st_mode=S_IFREG, st_size=0, ...}) = 0
clock_gettime(CLOCK_REALTIME, {1312517345, 425927762}) = 0

The first line was when the master was idle, and then I threw a few
requests at it.

- alex

Eric Wong

2011-Aug-05 08:07 UTC

head link

Strange quit behavior

Alex Sharp <ajsharp at gmail.com> wrote:> On Thu, Aug 4, 2011 at 9:09 PM, Alex Sharp <ajsharp at gmail.com>
wrote:
> > On Tue, Aug 2, 2011 at 2:54 PM, Eric Wong <normalperson at
yhbt.net> wrote:
> >> Can you try to strace (or equivalent) the old master to see
what''s using
> >> 100% CPU?
> >>
> 
> Actually, my fault -- the last email was the output of new master.
> Running strace on the old master shows the following:
Is this Unicorn 3.x?  Which 3.x version exactly?  Maybe give 4.0.1
or the 4.0.2 git prerelease a try, too.
> select(4, [3], NULL, NULL, {13, 466686}) = 0 (Timeout)
> wait4(-1, 0x7fff57e7bfcc, WNOHANG, NULL) = 0
> clock_gettime(CLOCK_REALTIME, {1312517345, 425411398}) = 0
> fstat(9, {st_mode=S_IFREG, st_size=0, ...}) = 0
> clock_gettime(CLOCK_REALTIME, {1312517345, 425625251}) = 0
> fstat(11, {st_mode=S_IFREG, st_size=0, ...}) = 0
> clock_gettime(CLOCK_REALTIME, {1312517345, 425779281}) = 0
> fstat(12, {st_mode=S_IFREG, st_size=0, ...}) = 0
> clock_gettime(CLOCK_REALTIME, {1312517345, 425927762}) = 0
Can I get more? (until the next select() call, at least).  I''d like to
see the timeout argument that gets passed to the next select().

If you see select() with very small timeout, use "strace -v" to get
the
st_ctime from fstat()s...

I could have a math bug in murder_lazy_workers (please read/review the
logic in that method, I haven''t noticed issues myself).

I made some tweaks to the master sleep strategy within the past year to
reduce wakeups during idle hours.  This is intended to go with Ruby
1.9.3 which will have /much/ better thread wakeup strategy that reduces
power consumption during idle times.
> The first line was when the master was idle, and then I threw a few
> requests at it.
Are all workers responding as expected and not dying?

-- 
Eric Wong

Alex Sharp

2011-Aug-17 04:26 UTC

head link

Strange quit behavior

On Fri, Aug 5, 2011 at 1:07 AM, Eric Wong <normalperson at yhbt.net>
wrote:> Is this Unicorn 3.x? ?Which 3.x version exactly? ?Maybe give 4.0.1
> or the 4.0.2 git prerelease a try, too.
We''re using version 3.6.2 with ruby 1.9.2-p180 on ubuntu 11.x. I know
the kernel on this version of ubuntu has a know signal trapping bug,
but I don''t think that''s what happening here. The processes
respond
after I send the restart signal to them again (USR2 + QUIT).
> Can I get more? (until the next select() call, at least). ?I''d
like to
> see the timeout argument that gets passed to the next select().
Here''s some more output. This is from the old master though, not a new
master that is pegging the CPU. In this instance it''s almost like the
old master just ignores the signal.

select(4, [3], NULL, NULL, {8, 129984}) = 0 (Timeout)
wait4(-1, 0x7fff16b82e4c, WNOHANG, NULL) = 0
clock_gettime(CLOCK_REALTIME, {1313554942, 259408364}) = 0
fstat(12, {st_dev=makedev(202, 1), st_ino=20373, st_mode=S_IFREG,
st_nlink=0, st_uid=1001, st_gid=1001, st_blksize=4096, st_blocks=0,
st_size=0, st_atime=2011/08/12-23:14:20, st_mtime=2011/08/12-23:14:20,
st_ctime=2011/08/17-04:22:21}) = 0
clock_gettime(CLOCK_REALTIME, {1313554942, 259775504}) = 0
fstat(13, {st_dev=makedev(202, 1), st_ino=20381, st_mode=S_IFREG,
st_nlink=0, st_uid=1001, st_gid=1001, st_blksize=4096, st_blocks=0,
st_size=0, st_atime=2011/08/12-23:14:20, st_mtime=2011/08/12-23:14:20,
st_ctime=2011/08/17-04:22:17}) = 0
clock_gettime(CLOCK_REALTIME, {1313554942, 259936816}) = 0
fstat(14, {st_dev=makedev(202, 1), st_ino=20382, st_mode=S_IFREG,
st_nlink=0, st_uid=1001, st_gid=1001, st_blksize=4096, st_blocks=0,
st_size=0, st_atime=2011/08/12-23:14:21, st_mtime=2011/08/12-23:14:21,
st_ctime=2011/08/17-04:22:19}) = 0
clock_gettime(CLOCK_REALTIME, {1313554942, 260086380}) = 0
fstat(15, {st_dev=makedev(202, 1), st_ino=20185, st_mode=S_IFREG,
st_nlink=0, st_uid=1001, st_gid=1001, st_blksize=4096, st_blocks=0,
st_size=0, st_atime=2011/08/12-23:14:21, st_mtime=2011/08/12-23:14:21,
st_ctime=2011/08/17-04:22:17}) = 0
clock_gettime(CLOCK_REALTIME, {1313554942, 260235797}) = 0
fstat(16, {st_dev=makedev(202, 1), st_ino=20255, st_mode=S_IFREG,
st_nlink=0, st_uid=1001, st_gid=1001, st_blksize=4096, st_blocks=0,
st_size=0, st_atime=2011/08/12-23:14:21, st_mtime=2011/08/12-23:14:21,
st_ctime=2011/08/17-04:22:19}) = 0
clock_gettime(CLOCK_REALTIME, {1313554942, 260384849}) = 0
fstat(17, {st_dev=makedev(202, 1), st_ino=20383, st_mode=S_IFREG,
st_nlink=0, st_uid=1001, st_gid=1001, st_blksize=4096, st_blocks=0,
st_size=0, st_atime=2011/08/12-23:14:21, st_mtime=2011/08/12-23:14:21,
st_ctime=2011/08/17-04:22:19}) = 0
clock_gettime(CLOCK_REALTIME, {1313554942, 260534792}) = 0
fstat(18, {st_dev=makedev(202, 1), st_ino=20384, st_mode=S_IFREG,
st_nlink=0, st_uid=1001, st_gid=1001, st_blksize=4096, st_blocks=0,
st_size=0, st_atime=2011/08/12-23:14:21, st_mtime=2011/08/12-23:14:21,
st_ctime=2011/08/17-04:22:19}) = 0
clock_gettime(CLOCK_REALTIME, {1313554942, 260684278}) = 0
fstat(10, {st_dev=makedev(202, 1), st_ino=20196, st_mode=S_IFREG,
st_nlink=0, st_uid=1001, st_gid=1001, st_blksize=4096, st_blocks=0,
st_size=0, st_atime=2011/08/16-16:33:46, st_mtime=2011/08/16-16:33:46,
st_ctime=2011/08/17-04:22:19}) = 0
clock_gettime(CLOCK_REALTIME, {1313554942, 260833725}) = 0
select(4, [3], NULL, NULL, {8, 976047}
> If you see select() with very small timeout, use "strace -v" to
get the
> st_ctime from fstat()s...
>
> I could have a math bug in murder_lazy_workers (please read/review the
> logic in that method, I haven''t noticed issues myself).
>
> I made some tweaks to the master sleep strategy within the past year to
> reduce wakeups during idle hours. ?This is intended to go with Ruby
> 1.9.3 which will have /much/ better thread wakeup strategy that reduces
> power consumption during idle times.
>
>> The first line was when the master was idle, and then I threw a few
>> requests at it.
>
> Are all workers responding as expected and not dying?
The old workers appear to be serving requests.

- Alex

Eric Wong

2011-Aug-17 09:22 UTC

head link

Strange quit behavior

Alex Sharp <ajsharp at gmail.com> wrote:> On Fri, Aug 5, 2011 at 1:07 AM, Eric Wong <normalperson at yhbt.net>
wrote:
> > Can I get more? (until the next select() call, at least).
?I''d like to
> > see the timeout argument that gets passed to the next select().
> 
> Here''s some more output. This is from the old master though, not a
new
> master that is pegging the CPU. In this instance it''s almost like
the
> old master just ignores the signal.
Wait, weren''t you trying to track down a problem with the master
that''s
pegging the CPU?  Isn''t the CPU usage the problem here?  I was also
thinking the CPU usage in the new master was caused by constant
worker respawning because bundler wasn''t loaded correctly somehow....
> select(4, [3], NULL, NULL, {8, 129984}) = 0 (Timeout)
<snip>> select(4, [3], NULL, NULL, {8, 976047}
OK, so /this/ master is confirmed to be sleeping with a reasonable
timeout and not pegging the CPU...

If you want to actually track down whether or not a signal is being
delivered, use "strace -f -e ''!futex''" since Ruby
1.9 has a dedicated
signal handling thread (at the OS level).

(You only need the -e ''!futex'' part to filter out the noise
from the
 signal handling thread in <= 1.9.2, 1.9.3 is much quieter :)


Below is a proposed patch (to unicorn.git) which may help debug issues
in the signal -> handler master path (but only once it enters the Ruby
runtime).  I''m a hesitant to commit it since it worthless if the Ruby
process is stuck because of some bad C extension.  That''s the most
common cause of stuck/unresponsive processes I''ve seen.

Subject: [PATCH] http_server: add debug statements for master sig handlers

This should help users track down what''s going on as soon as
Ruby can process the signal.  It''s still not going to be useful
if the Ruby process is stuck because of a bug in a C extension
or Ruby itself, though.

Unicorn always defers signals in the master process to serialize
signal handling since some of the important actions (e.g. HUP,
USR1, USR2) are NOT reentrant.
---
 lib/unicorn/http_server.rb |   13 +++++++++++--
 1 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/lib/unicorn/http_server.rb b/lib/unicorn/http_server.rb
index aa8212e..b45d6d6 100644
--- a/lib/unicorn/http_server.rb
+++ b/lib/unicorn/http_server.rb
@@ -126,8 +126,17 @@ class Unicorn::HttpServer
     # setup signal handlers before writing pid file in case people get
     # trigger happy and send signals as soon as the pid file exists.
     # Note that signals don''t actually get handled until the #join
method
-    QUEUE_SIGS.each { |sig| trap(sig) { SIG_QUEUE << sig; awaken_master }
}
-    trap(:CHLD) { awaken_master }
+    QUEUE_SIGS.each do |sig|
+      trap(sig) do
+        @logger.debug("received SIG#{sig}")
+        SIG_QUEUE << sig
+        awaken_master
+      end
+    end
+    trap(:CHLD) do
+      @logger.debug("received SIGCHLD")
+      awaken_master
+    end
     self.pid = config[:pid]
 
     self.master_pid = $$
-- 
Eric Wong

Eric Wong

2011-Aug-17 20:13 UTC

head link

Strange quit behavior

Eric Wong <normalperson at yhbt.net> wrote:> Below is a proposed patch (to unicorn.git) which may help debug issues
> in the signal -> handler master path (but only once it enters the Ruby
> runtime). I''m a hesitant to commit it since it worthless if the
Ruby
> process is stuck because of some bad C extension. That''s the most
> common cause of stuck/unresponsive processes I''ve seen.
I think that was a bad patch, adding signal handler debugging at the
Ruby layer leads to the false assumption that interpreter/VM is in a
good state. If you need to debug signal handlers, something is already
broken and tracing syscalls is the most reliable way to go.

Ruby (and any other high-level language) signal handling is not
straight forward[1].

Here''s how things work in Matz Ruby 1.9.x[2]:

you C timer thread Ruby Thread(s)
-------------------------------------------------------------------
traps signals ignores most signals
sleeps runs Ruby...
kill -USR2 ...
receives signal (async)
runs (system) sighandler[1]
wakes up from sleep
signals Ruby Thread(s)
*hopefully wakes up*
runs Ruby sighandler

The "*hopefully wakes up*" part is the part most likely to fail
as a result of a bad C extension or Ruby bug.

PS. In Ruby 1.9.3, timer thread uses the "self-pipe" sighandler
implementation that the unicorn master process always used.
This allows Ruby 1.9.3 to conserve power on idle processes.
In 1.9.2, the timer thread signal handler just polls in
10ms intervals to check if any signals were received.
This is why "strace -f" is noisy and I recommend "-e
''!futex''"
for 1.9.2.

PPS. Unicorn still uses the "self-pipe" signal handler in Ruby-land
because Ruby signal handlers are reentrant so must execute
reentrant-safe code. So without the self-pipe to serialize
the signal handler dispatch, the Ruby signal handler execution
can nest and overlap execution with itself. This means if USR2
is sent multiple times in short succession, you could spawn
multiple new unicorn masters

[1] - See "man 7 signal" in Linux manpages or POSIX specs for the
small list of safe functions that may be called in system-level
sighandlers. Ruby-level signal handlers can''t run inside
system-level signal handlers for this reason.

[2] - I think any high-level language that implements signal handlers
AND native threads must do something similar. The only valid
variation I can think of is to execute the high-level language
code inside the timer thread, but that requires the coders of
the high-level language to have thread-safety (not just
reentrancy) in mind when writing signal handlers even if the
rest of their code uses no threads.

--
Eric Wong

Alex Sharp

2011-Aug-18 23:13 UTC

head link

Strange quit behavior

Ok, thanks a lot for the for the patch attempt and subsequent in-depth
explanation. I know this thread is long-lived and a bit confusing,
mostly because I''ve seen two different sets of scenarios that yield
the same result, which is that unicorn does not restart.

The first scenario (and the one that started this thread) was that
unicorn actually receives the signal, spawns a new master, and that
process pegs the cpu and spins into eternity. The second scenario is
that the old master seems to completely ignore the USR2 signal the
first time, and when sent again, properly restarts. In both scenarios
the old master and workers continue to serve the old code.

I thought that setting the Unicorn::HttpServer::START_CTX[0] in my
unicorn config had solved the first scenario (pegged cpu), but I found
a new occurrence of it today, and I now have some new strace output
for this scenario (strace -v):

select(4, [3], NULL, NULL, {60, 727229}) = 0 (Timeout)
wait4(-1, 0x7ffffd70f72c, WNOHANG, NULL) = 0
clock_gettime(CLOCK_REALTIME, {1313708081, 720757662}) = 0
fstat(8, {st_dev=makedev(202, 1), st_ino=39806, st_mode=S_IFREG,
st_nlink=0, st_uid=1001, st_gid=1001, st_blksize=4096, st_blocks=0,
st_size=0, st_atime=2011/08/18-14:32:10, st_mtime=2011/08/18-14:32:10,
st_ctime=2011/08/18-22:54:06}) = 0
clock_gettime(CLOCK_REALTIME, {1313708081, 721131305}) = 0
fstat(10, {st_dev=makedev(202, 1), st_ino=45370, st_mode=S_IFREG,
st_nlink=0, st_uid=1001, st_gid=1001, st_blksize=4096, st_blocks=0,
st_size=0, st_atime=2011/08/18-14:32:10, st_mtime=2011/08/18-14:32:10,
st_ctime=2011/08/18-22:54:05}) = 0
clock_gettime(CLOCK_REALTIME, {1313708081, 721290800}) = 0
select(4, [3], NULL, NULL, {565, 34005}) = 0 (Timeout)
wait4(-1, 0x7ffffd70f72c, WNOHANG, NULL) = 0
clock_gettime(CLOCK_REALTIME, {1313708646, 853870540}) = 0
fstat(8, {st_dev=makedev(202, 1), st_ino=39806, st_mode=S_IFREG,
st_nlink=0, st_uid=1001, st_gid=1001, st_blksize=4096, st_blocks=0,
st_size=0, st_atime=2011/08/18-14:32:10, st_mtime=2011/08/18-14:32:10,
st_ctime=2011/08/18-22:59:06}) = 0
clock_gettime(CLOCK_REALTIME, {1313708646, 854102750}) = 0
fstat(10, {st_dev=makedev(202, 1), st_ino=45370, st_mode=S_IFREG,
st_nlink=0, st_uid=1001, st_gid=1001, st_blksize=4096, st_blocks=0,
st_size=0, st_atime=2011/08/18-14:32:10, st_mtime=2011/08/18-14:32:10,
st_ctime=2011/08/18-23:04:05}) = 0
clock_gettime(CLOCK_REALTIME, {1313708646, 854260655}) = 0
select(4, [3], NULL, NULL, {598, 630876}

With respect to the second scenario (ignoring signals), I''m going to
recommend we table that issue for now, as we''re currently running on a
version of ubuntu (11.10) which has a known signal trapping bug with
ruby 1.9.2-p180, and downgrading to 10.04 may solve that problem.
Granted, when I''ve observed this in the past with other libraries, the
process becomes completely unresponsive, whereas the behavior in
unicorn is more intermittent. Either way, we are in the process of
downgrading our servers to ubuntu 10.04, so let''s not waste anymore
time trying to debug something that may well be occurring due to a
kernel bug. If we continue to see this after the downgrade to 10.04,
I''ll report back and we can keep digging.

Again, my apologies for the confusion between the two scenarios, and
thanks for all your help.

- Alex

Eric Wong

2011-Aug-19 01:53 UTC

head link

Strange quit behavior

Alex Sharp <ajsharp at gmail.com> wrote:> Ok, thanks a lot for the for the patch attempt and subsequent in-depth
> explanation. I know this thread is long-lived and a bit confusing,
> mostly because I''ve seen two different sets of scenarios that
yield
> the same result, which is that unicorn does not restart.
No problem, I''m here to help and mutt makes it easy to follow long
threads :>
<snip>
> I thought that setting the Unicorn::HttpServer::START_CTX[0] in my
> unicorn config had solved the first scenario (pegged cpu), but I found
> a new occurrence of it today, and I now have some new strace output
> for this scenario (strace -v):
> 
> select(4, [3], NULL, NULL, {60, 727229}) = 0 (Timeout)
> wait4(-1, 0x7ffffd70f72c, WNOHANG, NULL) = 0
> clock_gettime(CLOCK_REALTIME, {1313708081, 720757662}) = 0
> fstat(8, {st_dev=makedev(202, 1), st_ino=39806, st_mode=S_IFREG,
> st_nlink=0, st_uid=1001, st_gid=1001, st_blksize=4096, st_blocks=0,
> st_size=0, st_atime=2011/08/18-14:32:10, st_mtime=2011/08/18-14:32:10,
> st_ctime=2011/08/18-22:54:06}) = 0
> clock_gettime(CLOCK_REALTIME, {1313708081, 721131305}) = 0
> fstat(10, {st_dev=makedev(202, 1), st_ino=45370, st_mode=S_IFREG,
> st_nlink=0, st_uid=1001, st_gid=1001, st_blksize=4096, st_blocks=0,
> st_size=0, st_atime=2011/08/18-14:32:10, st_mtime=2011/08/18-14:32:10,
> st_ctime=2011/08/18-22:54:05}) = 0
> clock_gettime(CLOCK_REALTIME, {1313708081, 721290800}) = 0
> select(4, [3], NULL, NULL, {565, 34005}) = 0 (Timeout)
<snip>  OK, this looks good from the system call level in the main
thread of the master process:

  I can see that from select() timing out despite being called with
  long intervals.  I don''t see workers dying, either...

Can you add "-f -e ''!futex''" to the strace
invocation so we see
all the threads?  You can also add ''-T'' to get timing
information
to see how long each syscall takes or ''-tt'' to get timestamps
in strace if you think it''s useful.

Since the main thread doesn''t appear to be doing much, it''s
probably a child thread somewhere...

Also, I assume you''re using preload_app?  If you are, do you see this
issue with preload_app=false?  I suspect there''s some background thread
that could be running in the master process taking up all the CPU time.
Unicorn itself never spawns background threads.

Basically anything you can tell use about the app, the configuration,
and which gems/libraries would be useful.
> With respect to the second scenario (ignoring signals), I''m going
to
> recommend we table that issue for now, as we''re currently running
on a
> version of ubuntu (11.10) which has a known signal trapping bug with
> ruby 1.9.2-p180, and downgrading to 10.04 may solve that problem.
> Granted, when I''ve observed this in the past with other libraries,
the
> process becomes completely unresponsive, whereas the behavior in
> unicorn is more intermittent. Either way, we are in the process of
> downgrading our servers to ubuntu 10.04, so let''s not waste
anymore
> time trying to debug something that may well be occurring due to a
> kernel bug. If we continue to see this after the downgrade to 10.04,
> I''ll report back and we can keep digging.
OK.  I wouldn''t rule out the CPU usage as unrelated to the signaling
issue, though.  The Ruby 1.9 timer thread could be going crazy
because it can''t signal or receive signals properly...

-- 
Eric Wong

Alex Sharp

2011-Aug-19 09:42 UTC

head link

Strange quit behavior

On Thu, Aug 18, 2011 at 6:53 PM, Eric Wong <normalperson at yhbt.net>
wrote:> ?I can see that from select() timing out despite being called with
> ?long intervals. ?I don''t see workers dying, either...
That''s correct, the old workers are staying alive as well.

> Can you add "-f -e ''!futex''" to the strace
invocation so we see
> all the threads? ?You can also add ''-T'' to get timing
information
> to see how long each syscall takes or ''-tt'' to get
timestamps
> in strace if you think it''s useful.
>
> Also, I assume you''re using preload_app? ?If you are, do you see
this
> issue with preload_app=false? ?I suspect there''s some background
thread
> that could be running in the master process taking up all the CPU time.
> Unicorn itself never spawns background threads.
Yes, we''re using preload_app=true. I haven''t tried it with
preload_app=false -- if I get to it tomorrow I''ll report back here.

We *are* using newrelic, which operates in a background thread. I''ve
just found this ticket in the newrelic support
forums:
https://support.newrelic.com/help/discussions/support/7692-newrelic_rpm-gem-with-unicorn-40.

I''ll re-run strace with the suggested flags above and report back.
> Basically anything you can tell use about the app, the configuration,
> and which gems/libraries would be useful.
Gemfile: https://gist.github.com/05a9445471ad7edfdcb7
> OK. ?I wouldn''t rule out the CPU usage as unrelated to the
signaling
> issue, though. ?The Ruby 1.9 timer thread could be going crazy
> because it can''t signal or receive signals properly...
--
alex sharp
github.com/ajsharp
twitter.com/ajsharp
alexjsharp.com

Eric Wong

2011-Aug-19 16:51 UTC

head link

Strange quit behavior

Alex Sharp <ajsharp at gmail.com> wrote:> On Thu, Aug 18, 2011 at 6:53 PM, Eric Wong <normalperson at yhbt.net>
wrote:
> > Basically anything you can tell use about the app, the configuration,
> > and which gems/libraries would be useful.
> 
> Gemfile: https://gist.github.com/05a9445471ad7edfdcb7
OK, out of the few gems I recognize, I don''t see anything else besides
newrelic_rpm that might run a background thread.  I would scan through
the rest of them to be sure (and maybe other folks here can give you
pointers).

Personally, I never depend on anything before I''ve had a chance to
scan through the code and know what it does, first.

-- 
Eric Wong

Eric Wong

2011-Aug-19 22:18 UTC

head link

Strange quit behavior

Eric Wong <normalperson at yhbt.net> wrote:> Eric Wong <normalperson at yhbt.net> wrote:
> > If you SIGQUIT/SIGTERM before the app is loaded, the signal could
> > be ignored.  This behavior should probably change...
> 
> I pushed the following to git://bogomips.org/unicorn
> 
> From 406b8b0e2ed6e5be34d8ec3cd4b16048233c2856 Mon Sep 17 00:00:00 2001
> From: Eric Wong <normalperson at yhbt.net>
> Date: Tue, 2 Aug 2011 23:52:14 +0000
> Subject: [PATCH] trap death signals in the worker sooner
> 
> This helps close a race condition preventing shutdown if
> loading the application (preload_app=false) takes a long
> time and the user decides to kil workers instead.
The following completely eliminates the race condition:
>From 8de6ab371c1623669b86a5dfa8703c8fd539011f Mon Sep 17 00:00:00 2001From: Eric Wong <normalperson at yhbt.net>
Date: Fri, 19 Aug 2011 22:13:04 +0000
Subject: [PATCH] close race if an exit signal hits the worker before trap

The signal handler from the master is still active and will
push the pending signal to SIG_QUEUE if a worker receives
a signal immediately after forking.
---
 lib/unicorn/http_server.rb |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/lib/unicorn/http_server.rb b/lib/unicorn/http_server.rb
index aa8212e..4f516c9 100644
--- a/lib/unicorn/http_server.rb
+++ b/lib/unicorn/http_server.rb
@@ -549,6 +549,7 @@ class Unicorn::HttpServer
   def init_worker_process(worker)
     # we''ll re-trap :QUIT later for graceful shutdown iff we accept
clients
     EXIT_SIGS.each { |sig| trap(sig) { exit!(0) } }
+    exit!(0) if (SIG_QUEUE & EXIT_SIGS)[0]
     WORKER_QUEUE_SIGS.each { |sig| trap(sig, nil) }
     trap(:CHLD, ''DEFAULT'')
     SIG_QUEUE.clear
-- 
Eric Wong

Alex Sharp

2011-Aug-23 02:59 UTC

head link

Strange quit behavior

>> Can you add "-f -e ''!futex''" to the strace
invocation so we see
>> all the threads? ?You can also add ''-T'' to get timing
information
>> to see how long each syscall takes or ''-tt'' to get
timestamps
>> in strace if you think it''s useful.
Here''s the result of strace with suggested flags on the old master.
(To clarify, this is still the spinning CPU issue)

? sudo strace -v -f -e ''!futex'' -p 18862
Process 18862 attached with 2 threads - interrupt to quit
[pid 22170] restart_syscall(<... resuming interrupted call ...>
<unfinished ...>
[pid 18862] select(4, [3], NULL, NULL, {3, 546106}) = 0 (Timeout)
[pid 18862] wait4(-1, 0x7fffbb907d3c, WNOHANG, NULL) = 0
[pid 18862] clock_gettime(CLOCK_REALTIME, {1314067031, 459472514}) = 0
[pid 18862] fstat(7, {st_dev=makedev(202, 1), st_ino=31695,
st_mode=S_IFREG, st_nlink=0, st_uid=1001, st_gid=1001,
st_blksize=4096, st_blocks=0, st_size=0, st_atime=2011/08/23-00:09:05,
st_mtime=2011/08/23-00:09:05, st_ctime=2011/08/23-02:37:06}) = 0
[pid 18862] clock_gettime(CLOCK_REALTIME, {1314067031, 459813102}) = 0
[pid 18862] fstat(11, {st_dev=makedev(202, 1), st_ino=31696,
st_mode=S_IFREG, st_nlink=0, st_uid=1001, st_gid=1001,
st_blksize=4096, st_blocks=0, st_size=0, st_atime=2011/08/23-00:09:05,
st_mtime=2011/08/23-00:09:05, st_ctime=2011/08/23-02:37:07}) = 0
[pid 18862] clock_gettime(CLOCK_REALTIME, {1314067031, 459999675}) = 0
[pid 18862] fstat(13, {st_dev=makedev(202, 1), st_ino=31697,
st_mode=S_IFREG, st_nlink=0, st_uid=1001, st_gid=1001,
st_blksize=4096, st_blocks=0, st_size=0, st_atime=2011/08/23-00:09:05,
st_mtime=2011/08/23-00:09:05, st_ctime=2011/08/23-02:37:07}) = 0
[pid 18862] clock_gettime(CLOCK_REALTIME, {1314067031, 460168995}) = 0
[pid 18862] fstat(14, {st_dev=makedev(202, 1), st_ino=31698,
st_mode=S_IFREG, st_nlink=0, st_uid=1001, st_gid=1001,
st_blksize=4096, st_blocks=0, st_size=0, st_atime=2011/08/23-00:09:05,
st_mtime=2011/08/23-00:09:05, st_ctime=2011/08/23-02:37:07}) = 0
[pid 18862] clock_gettime(CLOCK_REALTIME, {1314067031, 460343593}) = 0
[pid 18862] select(4, [3], NULL, NULL, {6, 255959}) = 0 (Timeout)
[pid 18862] wait4(-1, 0x7fffbb907d3c, WNOHANG, NULL) = 0
[pid 18862] clock_gettime(CLOCK_REALTIME, {1314067037, 739352608}) = 0
[pid 18862] fstat(7, {st_dev=makedev(202, 1), st_ino=31695,
st_mode=S_IFREG, st_nlink=0, st_uid=1001, st_gid=1001,
st_blksize=4096, st_blocks=0, st_size=0, st_atime=2011/08/23-00:09:05,
st_mtime=2011/08/23-00:09:05, st_ctime=2011/08/23-02:37:16}) = 0
[pid 18862] clock_gettime(CLOCK_REALTIME, {1314067037, 739570659}) = 0
[pid 18862] fstat(11, {st_dev=makedev(202, 1), st_ino=31696,
st_mode=S_IFREG, st_nlink=0, st_uid=1001, st_gid=1001,
st_blksize=4096, st_blocks=0, st_size=0, st_atime=2011/08/23-00:09:05,
st_mtime=2011/08/23-00:09:05, st_ctime=2011/08/23-02:37:17}) = 0
[pid 18862] clock_gettime(CLOCK_REALTIME, {1314067037, 739768755}) = 0
[pid 18862] fstat(13, {st_dev=makedev(202, 1), st_ino=31697,
st_mode=S_IFREG|01, st_nlink=0, st_uid=1001, st_gid=1001,
st_blksize=4096, st_blocks=0, st_size=0, st_atime=2011/08/23-00:09:05,
st_mtime=2011/08/23-00:09:05, st_ctime=2011/08/23-02:37:17}) = 0
[pid 18862] clock_gettime(CLOCK_REALTIME, {1314067037, 739951254}) = 0
[pid 18862] fstat(14, {st_dev=makedev(202, 1), st_ino=31698,
st_mode=S_IFREG|01, st_nlink=0, st_uid=1001, st_gid=1001,
st_blksize=4096, st_blocks=0, st_size=0, st_atime=2011/08/23-00:09:05,
st_mtime=2011/08/23-00:09:05, st_ctime=2011/08/23-02:37:17}) = 0
[pid 18862] clock_gettime(CLOCK_REALTIME, {1314067037, 740130965}) = 0
[pid 18862] select(4, [3], NULL, NULL, {9, 996190}^C <unfinished ...>
>> Also, I assume you''re using preload_app? ?If you are, do you
see this
>> issue with preload_app=false? ?I suspect there''s some
background thread
>> that could be running in the master process taking up all the CPU time.
>> Unicorn itself never spawns background threads.
I went ahead and ran strace with the same flags on the *new* master,
and saw a bunch of output that looked bundler-related:
https://gist.github.com/138344b5b19ec6ba1a4c

Even more bizarre, eventually the process started successfully :-/ Is
it possible this had something to do with strace de-taching?

You can see this in the unicorn.stderr.log file I included in the
gist. Check out these two lines in particular, which occur 25 minutes
apart:

I, [2011-08-23T02:15:08.396868 #22169]  INFO -- : Refreshing Gem list
I, [2011-08-23T02:40:16.621210 #22925]  INFO -- : worker=1 spawned pid=22925

Question: If I turn on debug flags in unicorn, will I get much of this
output in the unicorn.stderr log?

- Alex

Eric Wong

2011-Aug-23 07:12 UTC

head link

Strange quit behavior

Alex Sharp <ajsharp at gmail.com> wrote:> >> Can you add "-f -e ''!futex''" to the
strace invocation so we see
> >> all the threads? ?You can also add ''-T'' to get
timing information
> >> to see how long each syscall takes or ''-tt'' to
get timestamps
> >> in strace if you think it''s useful.
> 
> Here''s the result of strace with suggested flags on the old
master.
> (To clarify, this is still the spinning CPU issue)
> 
> ? sudo strace -v -f -e ''!futex'' -p 18862
> Process 18862 attached with 2 threads - interrupt to quit
> [pid 22170] restart_syscall(<... resuming interrupted call ...>
<unfinished ...>
> [pid 18862] select(4, [3], NULL, NULL, {3, 546106}) = 0 (Timeout)
<snip>

22170 is presumably the timer thread.

Did you send any signals to 18862 while you were making this strace?

If you did send a signal to 18862 while stracing, it''s worrying it
didn''t show up in your strace.  If you didn''t send a signal,
then
try sending a signal.
> >> Also, I assume you''re using preload_app? ?If you are, do
you see this
> >> issue with preload_app=false? ?I suspect there''s some
background thread
> >> that could be running in the master process taking up all the CPU
time.
> >> Unicorn itself never spawns background threads.
> 
> I went ahead and ran strace with the same flags on the *new* master,
> and saw a bunch of output that looked bundler-related:
> https://gist.github.com/138344b5b19ec6ba1a4c
> 
> Even more bizarre, eventually the process started successfully :-/ Is
> it possible this had something to do with strace de-taching?
That looks like pretty normal "require" behavior.  strace would slow
down requires, a lot.

So this was with preload_app=true?  While you''re debugging problems,
I suggest keeping preload_app=false and worker_problems=1 to minimize
the variables.
> You can see this in the unicorn.stderr.log file I included in the
> gist. Check out these two lines in particular, which occur 25 minutes
> apart:
> 
> I, [2011-08-23T02:15:08.396868 #22169]  INFO -- : Refreshing Gem list
> I, [2011-08-23T02:40:16.621210 #22925]  INFO -- : worker=1 spawned
pid=22925
Wow, it takes 25 minutes to load your application?  strace makes the
application /much/ slower, so I can actually believe it takes that long.

Loading large applications is very slow under Ruby 1.9.2, there''s some
pathological load performance issues fixed in 1.9.3.

So you''re saying /without/ strace, CPU usage _stays_ at 100% and
_never_
goes down?
> Question: If I turn on debug flags in unicorn, will I get much of this
> output in the unicorn.stderr log?
You''ll get every exception that''s thrown even if it''s
rescued.
Unicorn 3.x itself doesn''t throw that many exceptions.


Generally, can you reproduce this behavior on a plain (empty) Rails
application with no extra gems?

-- 
Eric Wong

Alex Sharp

2011-Aug-23 16:49 UTC

head link

Strange quit behavior

On Tuesday, August 23, 2011 at 12:12 AM, Eric Wong
wrote:> Did you send any signals to 18862 while you were making this strace?Not while, no. I had already sent a USR2 signal to this process (the old
master), and then I attached strace to it.

I''ll try sending another USR2 signal next time while strace is
attached. > >  I went ahead and ran strace with the same flags on the *new* master,
> > and saw a bunch of output that looked bundler-related:
> > https://gist.github.com/138344b5b19ec6ba1a4c
> > 
> > Even more bizarre, eventually the process started successfully :-/ Is
> > it possible this had something to do with strace de-taching?
> 
> That looks like pretty normal "require" behavior. strace would
slow
>  down requires, a lot.
> 
> So this was with preload_app=true? While you''re debugging
problems,
> I suggest keeping preload_app=false and worker_problems=1 to minimize
> the variables.Ok, I''ll change those and report back. I''m guessing you meant
worker_processes (not problems)? > 
> > You can see this in the unicorn.stderr.log file I included in the
> > gist. Check out these two lines in particular, which occur 25 minutes
> >  apart:
> > 
> > I, [2011-08-23T02:15:08.396868 #22169] INFO -- : Refreshing Gem list
> > I, [2011-08-23T02:40:16.621210 #22925] INFO -- : worker=1 spawned
pid=22925
> 
> Wow, it takes 25 minutes to load your application? strace makes the
>  application /much/ slower, so I can actually believe it takes that long.No, my mistake. Loading the application only takes about 10 seconds, and I only
had strace attached to this process for a few seconds (less than 10). My point
here was to show that the new master just spun for a good 25 minutes (presumably
trying to load files over and over again), and then, seemingly out of nowhere,
the new master came up and spawned the new workers.
Next time I''ll try to get attached with strace earlier and record more
output.
> 
>  Loading large applications is very slow under Ruby 1.9.2, there''s
some
> pathological load performance issues fixed in 1.9.3.
> Yep, I''ve read about those, and I''ve seen Xavier''s
patch, but I don''t think that''s the issue here (though, it
appears that''s why the files attempting to be loaded in the strace
output do not exist). Under normal circumstances, loading the app takes about 10
seconds and doesn''t peg the cpu while doing it.> 
> So you''re saying /without/ strace, CPU usage _stays_ at 100% and
_never_
> goes down?
Correct. > 
> 
> Generally, can you reproduce this behavior on a plain (empty) Rails
>  application with no extra gems?
> Good idea, I''ll try that next.

- Alex

Eric Wong

2011-Aug-23 20:23 UTC

head link

Strange quit behavior

Alex Sharp <ajsharp at gmail.com> wrote:> On Tuesday, August 23, 2011 at 12:12 AM, Eric Wong wrote:
> > So this was with preload_app=true? While you''re debugging
problems,
> > I suggest keeping preload_app=false and worker_problems=1 to minimize
> > the variables.
> Ok, I''ll change those and report back. I''m guessing you
meant
> worker_processes (not problems)? 
LOL yes, it was late last night :)
> > Generally, can you reproduce this behavior on a plain (empty) Rails
> >  application with no extra gems?
> > 
> Good idea, I''ll try that next.
I rarely run into problems, but doing things like this (starting with
empty app, Unicorn as default-as-possible (worker_processes=1,
preload_app=false)) should be standard procedure for troubleshooting.

These things are so second nature to me and I forget to mention it :<

I''m also the type who reads the code of /every/ library I introduce
into
the app so I know (or am very quickly capable) of knowing exactly what
an app is doing.

-- 
Eric Wong

Eric Wong

2011-Aug-31 00:33 UTC

head link

Strange quit behavior

Eric Wong <normalperson at yhbt.net> wrote:> +      trap(sig) do
> +        @logger.debug("received SIG#{sig}")
I''m even more glad I didn''t apply this patch to Unicorn.  I
completely
forgot Logger uses a mutex internally (even though it doesn''t need to
when writing to a POSIX-compliant file system).

Rainbows! has a similar issue I fixed/worked around:
  http://mid.gmane.org/20110830233232.GA19633 at dcvr.yhbt.net

-- 
Eric Wong

Alex Sharp

2011-Sep-01 18:45 UTC

head link

Strange quit behavior

On Tuesday, August 30, 2011 at 5:33 PM, Eric Wong wrote:> Eric Wong <normalperson at yhbt.net (mailto:normalperson at
yhbt.net)> wrote:
> > + trap(sig) do
> > + @logger.debug("received SIG#{sig}")
> 
> I''m even more glad I didn''t apply this patch to Unicorn.
I completely
> forgot Logger uses a mutex internally (even though it doesn''t need
to
> when writing to a POSIX-compliant file system).
> 
> Rainbows! has a similar issue I fixed/worked around:
> http://mid.gmane.org/20110830233232.GA19633 at dcvr.yhbt.netSo are you saying that Unicorn is affected with the same issue? I''m
confused -- does the master send USR1 signals to the workers when it receives a
USR2? I''m not sending a USR1 signal to the master.

- Alex

Eric Wong

2011-Sep-01 19:46 UTC

head link

Strange quit behavior

Alex Sharp <ajsharp at gmail.com> wrote:> On Tuesday, August 30, 2011 at 5:33 PM, Eric Wong wrote:
> > Eric Wong <normalperson at yhbt.net (mailto:normalperson at
yhbt.net)> wrote:
> > > + trap(sig) do
> > > + @logger.debug("received SIG#{sig}")
> > 
> > I''m even more glad I didn''t apply this patch to
Unicorn. I completely
> > forgot Logger uses a mutex internally (even though it doesn''t
need to
> > when writing to a POSIX-compliant file system).
> > 
> > Rainbows! has a similar issue I fixed/worked around:
> > http://mid.gmane.org/20110830233232.GA19633 at dcvr.yhbt.net
> 
> So are you saying that Unicorn is affected with the same issue?
I''m
> confused -- does the master send USR1 signals to the workers when it
> receives a USR2? I''m not sending a USR1 signal to the master. 
No, Unicorn is not affected.  It /would have/ been, had I applied that
patch.

-- 
Eric Wong

Alex Sharp

2011-Sep-01 19:57 UTC

head link

Strange quit behavior

Also, today I did an strace on an old unicorn master that was not responding to
USR2 signals (but not pegging the cpu), and I saw tons of gettimeofday calls in
the output. Is this normal? Also, I was seeing this after having removed
newrelic from the project.

--Alex Sharp

github.com/ajsharp
twitter.com/ajsharp
alexjsharp.com

Eric Wong

2011-Sep-01 20:26 UTC

head link

Strange quit behavior

Alex Sharp <ajsharp at gmail.com> wrote:>  Also, today I did an strace on an old unicorn master that was not
>  responding to USR2 signals (but not pegging the cpu), and I saw tons
>  of gettimeofday calls in the output. Is this normal? Also, I was
>  seeing this after having removed newrelic from the project. 
Yes, if it''s from the same (timer) thread that''s making a
bunch of
futex() calls (which you''ve probably filtered out) on 1.9.2.
They''ll go away on Ruby 1.9.3.

On x86_64 and a modern-ish kernel + glibc, gettimeofday() isn''t even
a syscall (and won''t show up on strace):
http://docs.redhat.com/docs/en-US/Red_Hat_Enterprise_MRG/1.3/html/Realtime_Tuning_Guide/sect-Realtime_Tuning_Guide-General_System_Tuning-gettimeofday_speedup.html

mongrel unicorn - Aug 2011 - Strange quit behavior

Strange quit behavior

Strange quit behavior

Strange quit behavior

Strange quit behavior

Strange quit behavior

Strange quit behavior

Strange quit behavior

Strange quit behavior

Strange quit behavior

Strange quit behavior

Strange quit behavior

Strange quit behavior

Strange quit behavior

Strange quit behavior

Strange quit behavior

Strange quit behavior

Strange quit behavior

Strange quit behavior

Strange quit behavior

Strange quit behavior

Strange quit behavior

Strange quit behavior

Strange quit behavior

Strange quit behavior

Strange quit behavior

Strange quit behavior

Strange quit behavior

Strange quit behavior

Strange quit behavior

Strange quit behavior