Since upgrading bundler (but applying the RUBYOPT path fixes) we''ve started experiencing intermittent, difficult-to-isolate USR2 restart failures. After a USR2 signal our process tree winds up looking like this, with several master-esque processes listed as children (but without the "worker[N]" label): app 14402 4.4 0.8 199612 70264 ? S 14:07 0:04 unicorn_rails master -c config/unicorn.rb -E production -D app 14433 0.0 0.8 204540 68504 ? Sl 14:07 0:00 \_ unicorn_rails worker[0] -c config/unicorn.rb -E production -D app 14435 0.0 0.8 204540 68508 ? Sl 14:07 0:00 \_ unicorn_rails worker[1] -c config/unicorn.rb -E production -D app 14438 0.0 0.8 199748 65840 ? S 14:07 0:00 \_ /usr/bin/ruby1.8 /usr/bin/unicorn_rails -c config/unicorn.rb -E production -D app 14440 0.0 0.8 204540 68508 ? Sl 14:07 0:00 \_ unicorn_rails worker[3] -c config/unicorn.rb -E production -D app 14442 0.0 0.8 204540 68508 ? Sl 14:07 0:00 \_ unicorn_rails worker[4] -c config/unicorn.rb -E production -D app 14445 0.0 0.8 199760 65840 ? S 14:07 0:00 \_ /usr/bin/ruby1.8 /usr/bin/unicorn_rails -c config/unicorn.rb -E production -D app 14447 0.0 0.8 204540 68508 ? Sl 14:07 0:00 \_ unicorn_rails worker[6] -c config/unicorn.rb -E production -D app 14449 0.0 0.8 204780 69272 ? Sl 14:07 0:00 \_ unicorn_rails worker[7] -c config/unicorn.rb -E production -D Sending another USR2 signal will bring a new master into the mix as a child, spins up a single child worker of its own (which also resembles the "/usr/bin/ruby1.8" master-esque processes), and then fails to continue. Further USR2 restarts will obviously do nothing, and we''re forced to either kill -9 the stillborn master or cold-restart all of the unicorns. Nothing out of the ordinary is dumped to stderr or stdout Starting unicorns fresh produces a nice process list where every child is listed cleanly as "unicorn_rails worker[N]" We only have this issue in one of our applications, on a machine that has 1 Rails app & 2 Sinatra apps, all powered by nginx+unicorn. We''ve also only noticed this since upgrading to bundler from bundler08 Are the goofy worker processes in the process tree a real problem, or just a red herring? -jamie
Jamie Wilkinson <jamie at tramchase.com> wrote:> Since upgrading bundler (but applying the RUBYOPT path fixes) we''ve > started experiencing intermittent, difficult-to-isolate USR2 restart > failures. > > After a USR2 signal our process tree winds up looking like this, with > several master-esque processes listed as children (but without the > "worker[N]" label): > > app 14402 4.4 0.8 199612 70264 ? S 14:07 0:04 unicorn_rails master -c config/unicorn.rb -E production -D > app 14433 0.0 0.8 204540 68504 ? Sl 14:07 0:00 \_ unicorn_rails worker[0] -c config/unicorn.rb -E production -D > app 14435 0.0 0.8 204540 68508 ? Sl 14:07 0:00 \_ unicorn_rails worker[1] -c config/unicorn.rb -E production -D > app 14438 0.0 0.8 199748 65840 ? S 14:07 0:00 \_ /usr/bin/ruby1.8 /usr/bin/unicorn_rails -c config/unicorn.rb -E production -D > app 14440 0.0 0.8 204540 68508 ? Sl 14:07 0:00 \_ unicorn_rails worker[3] -c config/unicorn.rb -E production -D > app 14442 0.0 0.8 204540 68508 ? Sl 14:07 0:00 \_ unicorn_rails worker[4] -c config/unicorn.rb -E production -D > app 14445 0.0 0.8 199760 65840 ? S 14:07 0:00 \_ /usr/bin/ruby1.8 /usr/bin/unicorn_rails -c config/unicorn.rb -E production -D > app 14447 0.0 0.8 204540 68508 ? Sl 14:07 0:00 \_ unicorn_rails worker[6] -c config/unicorn.rb -E production -D > app 14449 0.0 0.8 204780 69272 ? Sl 14:07 0:00 \_ unicorn_rails worker[7] -c config/unicorn.rb -E production -D> Sending another USR2 signal will bring a new master into the mix as a > child, spins up a single child worker of its own (which also resembles > the "/usr/bin/ruby1.8" master-esque processes), and then fails to > continue.Hi Jamie, Odd, if I had to guess PIDs 14438 and 14445 are actually worker[2] and worker[5] based on the PIDs relative to other workers. So the new master died right away, which really should''ve been logged...> Further USR2 restarts will obviously do nothing, and we''re forced to > either kill -9 the stillborn master or cold-restart all of the > unicorns. Nothing out of the ordinary is dumped to stderr or stdoutAnything in your before_fork/after_fork hooks? Since it looks like you''re on a Linux system, can you strace the master while you send it a USR2 and see if anything strange happens? Also, can you strace the weird looking processes above and see if they''re doing anything?> Starting unicorns fresh produces a nice process list where every child > is listed cleanly as "unicorn_rails worker[N]" > > We only have this issue in one of our applications, on a machine that > has 1 Rails app & 2 Sinatra apps, all powered by nginx+unicorn. We''ve > also only noticed this since upgrading to bundler from bundler08I assume you''re using regular "unicorn" to run your Sinatra apps and not "unicorn_rails". I made some largish cleanups to both for the 0.97.0 release and and perhaps some bugs slipped into the "_rails" variant. Can you try regular "unicorn" with a config.ru for Rails? I''ve stolen this from the Rainbows! FAQ (http://rainbows.rubyforge.org/FAQ.html): For Rails 2.3.x and later, the following config.ru will work for you: ENV["RAILS_ENV"] ||= ENV["RACK_ENV"] require "config/environment" use Rails::Rack::Static run ActionController::Dispatcher.new For older versions of Rails, the following config.ru will work: ENV["RAILS_ENV"] ||= ENV["RACK_ENV"] require ''config/boot'' require ''config/environment'' require ''unicorn/app/old_rails'' require ''unicorn/app/old_rails/static'' # not needed with Unicorn 0.95+ use Unicorn::App::OldRails::Static run Unicorn::App::OldRails.new One thing to watch out for is that RAILS_ENV will not be set in the environment for you, thus we set it to match RACK_ENV.> Are the goofy worker processes in the process tree a real problem, or > just a red herring?Not sure if it''s a problem, but with Bundler I assume Rack itself is a bundled dependency, but you''re starting unicorn_rails out of /usr/bin/unicorn_rails which indicates Unicorn is not bundled (and won''t use the bundled Rack). Can you ensure your unbundled Rack is the same version as the bundled one to be on the safe side? I''ve yet to try bundler 0.9 (and have barely used earlier), but you can also try bundling Unicorn and using the bundled bin/unicorn(_rails) launchers instead to ensure a consistent environment. Unicorn currently ends up (auto)loading "rack/utils" before the application is loaded, maybe it could (auto)load it after the app is loaded for preload_app users. -- Eric Wong
Eric Wong <normalperson at yhbt.net> wrote:> I assume you''re using regular "unicorn" to run your Sinatra apps and not > "unicorn_rails". I made some largish cleanups to both for the 0.97.0 > release and and perhaps some bugs slipped into the "_rails" variant.<snip>> Jamie Wilkinson <jamie at tramchase.com> wrote: > > Are the goofy worker processes in the process tree a real problem, or > > just a red herring? > > Not sure if it''s a problem, but with Bundler I assume Rack itself is a > bundled dependency, but you''re starting unicorn_rails out of > /usr/bin/unicorn_rails which indicates Unicorn is not bundled (and won''t > use the bundled Rack). Can you ensure your unbundled Rack is the same > version as the bundled one to be on the safe side? > > I''ve yet to try bundler 0.9 (and have barely used earlier), but you can > also try bundling Unicorn and using the bundled bin/unicorn(_rails) > launchers instead to ensure a consistent environment. > > Unicorn currently ends up (auto)loading "rack/utils" before the > application is loaded, maybe it could (auto)load it after the app is > loaded for preload_app users.Do the following two patches help? I''ve also pushed out a few cleanups to unicorn.git and also put up a prerelease gem at: http://unicorn.bogomips.org/files/gems/unicorn-0.97.0.7.g22e3.gem Shortlog of changes since 0.97.0: Eric Wong (7): tests: fix to run under Ruby 1.9.2dev KNOWN_ISSUES: document Array#shuffle / Array#sample issue under 1.9 unicorn_rails: use TOPLEVEL_BINDING for eval unicorn_rails: respect user''s encoding in config.ru in 1.9 unicorn_rails: rename variable for consistency bin/*: remove unnecessary listeners variable unicorn: load constants after app has loaded>From e1a72d58add4260feb6da56d9d588270173da74f Mon Sep 17 00:00:00 2001From: Eric Wong <normalperson at yhbt.net> Date: Thu, 8 Apr 2010 17:10:46 -0700 Subject: [PATCH] unicorn_rails: use TOPLEVEL_BINDING for eval This is to ensure there are no namespace inconsistencies --- bin/unicorn_rails | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/bin/unicorn_rails b/bin/unicorn_rails index 37ee027..de2361e 100755 --- a/bin/unicorn_rails +++ b/bin/unicorn_rails @@ -148,7 +148,7 @@ def rails_builder(config, daemonize) when /\.ru$/ raw = File.open(config, "rb") { |fp| fp.sysread(fp.stat.size) } raw.sub!(/^__END__\n.*/, '''') - eval("Rack::Builder.new {(#{raw}\n)}.to_app", nil, config) + eval("Rack::Builder.new {(#{raw}\n)}.to_app", TOPLEVEL_BINDING, config) else require config Object.const_get(File.basename(config, ''.rb'').capitalize) -->From 22e3ed4de0e89b97dac91c95c796eb8a7f93e5de Mon Sep 17 00:00:00 2001From: Eric Wong <normalperson at yhbt.net> Date: Thu, 8 Apr 2010 18:05:43 -0700 Subject: [PATCH] unicorn: load constants after app has loaded This will help ensure we use the same version of Rack the application uses and avoid loading conflicting/incompatible versions. --- lib/unicorn.rb | 8 ++++---- 1 files changed, 4 insertions(+), 4 deletions(-) diff --git a/lib/unicorn.rb b/lib/unicorn.rb index b63abeb..75ce09d 100644 --- a/lib/unicorn.rb +++ b/lib/unicorn.rb @@ -800,15 +800,15 @@ module Unicorn def build_app! if app.respond_to?(:arity) && app.arity == 0 - # exploit COW in case of preload_app. Also avoids race - # conditions in Rainbows! since load/require are not thread-safe - Unicorn.constants.each { |x| Unicorn.const_get(x) } - if defined?(Gem) && Gem.respond_to?(:refresh) logger.info "Refreshing Gem list" Gem.refresh end self.app = app.call + + # exploit COW in case of preload_app. Also avoids race + # conditions in Rainbows! since load/require are not thread-safe + Unicorn.constants.each { |x| Unicorn.const_get(x) } end end -- Eric Wong
Eric Wong <normalperson at yhbt.net> wrote:> Do the following two patches help? I''ve also pushed out a few > cleanups to unicorn.git and also put up a prerelease gem at:Hi Jamie, did you get a chance to try this? -- Eric Wong
Whew! After a long day of debugging, I think I''ve gotten closer to isolating this. I''ve upgraded us to your pre-release gem but I can still get it working (both before & after a lot of the below fixes)>> After a USR2 signal our process tree winds up looking like this, with >> several master-esque processes listed as children (but without the >> "worker[N]" label): >> >> app 14402 4.4 0.8 199612 70264 ? S 14:07 0:04 unicorn_rails master -c config/unicorn.rb -E production -D >> app 14433 0.0 0.8 204540 68504 ? Sl 14:07 0:00 \_ unicorn_rails worker[0] -c config/unicorn.rb -E production -D >> app 14435 0.0 0.8 204540 68508 ? Sl 14:07 0:00 \_ unicorn_rails worker[1] -c config/unicorn.rb -E production -D >> app 14438 0.0 0.8 199748 65840 ? S 14:07 0:00 \_ /usr/bin/ruby1.8 /usr/bin/unicorn_rails -c config/unicorn.rb -E production -D >> app 14440 0.0 0.8 204540 68508 ? Sl 14:07 0:00 \_ unicorn_rails worker[3] -c config/unicorn.rb -E production -D >> app 14442 0.0 0.8 204540 68508 ? Sl 14:07 0:00 \_ unicorn_rails worker[4] -c config/unicorn.rb -E production -D >> app 14445 0.0 0.8 199760 65840 ? S 14:07 0:00 \_ /usr/bin/ruby1.8 /usr/bin/unicorn_rails -c config/unicorn.rb -E production -D >> app 14447 0.0 0.8 204540 68508 ? Sl 14:07 0:00 \_ unicorn_rails worker[6] -c config/unicorn.rb -E production -D >> app 14449 0.0 0.8 204780 69272 ? Sl 14:07 0:00 \_ unicorn_rails worker[7] -c config/unicorn.rb -E production -D >> >> Sending another USR2 signal will bring a new master into the mix as a >> child, spins up a single child worker of its own (which also resembles >> the "/usr/bin/ruby1.8" master-esque processes), and then fails to >> continue. > > Anything in your before_fork/after_fork hooks? Since it looks like > you''re on a Linux system, can you strace the master while you send > it a USR2 and see if anything strange happens?The only real contents of our before_hook is a send-QUIT-on-first-worker, which I swapped out for the default SIGTTOU behavior. No change.> I assume you''re using regular "unicorn" to run your Sinatra apps and not > "unicorn_rails". I made some largish cleanups to both for the 0.97.0 > release and and perhaps some bugs slipped into the "_rails" variant. > > Not sure if it''s a problem, but with Bundler I assume Rack itself is a > bundled dependency, but you''re starting unicorn_rails out of > /usr/bin/unicorn_rails which indicates Unicorn is not bundled (and won''t > use the bundled Rack). Can you ensure your unbundled Rack is the same > version as the bundled one to be on the safe side? >My system & bundled rack versions match. Swapped to vanilla "unicorn" instead of "unicorn_rails" -- also no dice I switched to using "bundle exec unicorn", which uses RAILS_ROOT/vendor/bundler_gems/bin/unicorn instead of /usr/bin/unicorn. Was convinced this would be it, but no dice. Attaching some relevant stack traces... the new "orphan" master & its 1st child are pretty boring while just hanging out: Process 20738 attached - interrupt to quit futex(0x2aaaaafb23c0, FUTEX_WAIT, 2, NULL Sending a USR2 to the new, orphaned master... Process 20738 attached - interrupt to quit select(10, [9], [], [], {23, 661000}) = ? ERESTARTNOHAND (To be restarted) --- SIGUSR2 (User defined signal 2) @ 0 (0) --- rt_sigreturn(0xc) = -1 EINTR (Interrupted system call) rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 clock_gettime(CLOCK_MONOTONIC, {2782804, 359708496}) = 0 select(0, [], [], [], {0, 0}) = 0 (Timeout) rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 fcntl(5, F_GETFL) = 0x801 (flags O_WRONLY|O_NONBLOCK) write(5, "."..., 1) = 1 clock_gettime(CLOCK_MONOTONIC, {2782804, 360046496}) = 0 select(10, [9], [], [], {20, 464341} I''ve also produced straces of the *original* master during USR2 restarts, both a success trace and a failure trace. Here''s a tarball with both complete traces as well as filtered/grepp''d ones: http://jamiedubs.com/files/unicorn-strace.tgz I''ve also found that kill -9''ing the 1st worker of the new orphaned master allows it to continue operation as normal (spinning up workers and taking control from the original master) -- suggesting something is up with just that first worker (!). I''m going to keep noodling with before_/after_fork strategies. -jamie
Jamie Wilkinson <jamie at tramchase.com> wrote:> I''ve also produced straces of the *original* master during USR2 > restarts, both a success trace and a failure trace. Here''s a tarball > with both complete traces as well as filtered/grepp''d ones: > http://jamiedubs.com/files/unicorn-strace.tgz > > I''ve also found that kill -9''ing the 1st worker of the new orphaned > master allows it to continue operation as normal (spinning up workers > and taking control from the original master) -- suggesting something > is up with just that first worker (!). I''m going to keep noodling with > before_/after_fork strategies.Hi Jamie, your successful straces have this oddity in them: poll([{fd=7, events=POLLIN|POLLPRI}], 1, 0) = 1 ([{fd=7, revents=POLLIN|POLLHUP} read(7, ""..., 8192) = 0 shutdown(7, 2 /* send and receive */) = -1 ENOTCONN (Transport endpoint is not close(7) = 0> Eric Wong <normalperson at yhbt.net> wrote: > > Anything in your before_fork/after_fork hooks? Since it looks like > > you''re on a Linux system, can you strace the master while you send > > it a USR2 and see if anything strange happens? > > The only real contents of our before_hook is a > send-QUIT-on-first-worker, which I swapped out for the default SIGTTOU > behavior. No change.It looks like you should be disconnecting/reconnecting whatever is using fd=7 in the before_fork/after_fork hooks respectively. Unicorn (and mainline Ruby) do not use the poll() system call (which is why that strace raised red flags), so there''s another C extension calling poll() in your master process (not good). I remember one (or all of) the Ruby Postgres libraries using poll() internally, and there are likely other things that use poll() in Ruby as well. Perhaps something like this in the config (assuming you''re using Rails ActiveRecord) (also see http://unicorn.bogomips.org/examples/) before_fork do |server,worker| defined?(ActiveRecord::Base) and ActiveRecord::Base.connection.disconnect! end after_fork do |server,worker| defined?(ActiveRecord::Base) and ActiveRecord::Base.establish_connection end -- Eric Wong
Hi Jamie, any more news on this issue? I plan on doing a minor release later today or tomorrow with a few minor bug fixes in unicorn.git -- Eric Wong
Update from the trenches: I''ve traced this down to the newrelic_rpm agent Noticed this is not the 1st time this has been brought up, since newrelic spins up the stats collector in its own thread. Attempted the (old) logger mutex monkeypatch mentioned in the unicorn docs without luck. Noodled with various permutations of NewRelic::Agent.shutdown in before/after_fork without success. NewRelic apparently has some compat issues with bundler, but that didn''t affect it, nor did switching to the plugin. I''m running the latest newrelic_rpm agent (2.11.2) and the latest unicorn (0.97.1). I imagine this is contention over its logfile. Is there any low-hanging fruit I should try? I''ve also filed a bug with NewRelic: http://support.newrelic.com/discussions/support/2577-newrelic-agentbundler-causing-stillborn-unicorn-processes?unresolve=true -jamie On Apr 19, 2010, at 11:21 AM, Eric Wong wrote:> Hi Jamie, any more news on this issue? I plan on doing a minor release > later today or tomorrow with a few minor bug fixes in unicorn.git > > -- > Eric Wong
Jamie Wilkinson <jamie at tramchase.com> wrote:> Update from the trenches: I''ve traced this down to the newrelic_rpm > agent > > Noticed this is not the 1st time this has been brought up, since > newrelic spins up the stats collector in its own thread. > > Attempted the (old) logger mutex monkeypatch mentioned in the unicorn > docs without luck. Noodled with various permutations of > NewRelic::Agent.shutdown in before/after_fork without success. > NewRelic apparently has some compat issues with bundler, but that > didn''t affect it, nor did switching to the plugin. > > I''m running the latest newrelic_rpm agent (2.11.2) and the latest > unicorn (0.97.1). > > I imagine this is contention over its logfile. Is there any > low-hanging fruit I should try?Hi Jamie, thanks for the follow up. Exactly which version of Ruby + patchlevel are you using? Some of the 1.8.7 releases had threading bugs in them which you may be hitting: http://redmine.ruby-lang.org/issues/show/1993 http://www.daniel-azuma.com/blog/view/z2ysbx0e4c3it9/ruby_1_8_7_io_select_threading_bug But ...> I''ve also filed a bug with NewRelic: > http://support.newrelic.com/discussions/support/2577-newrelic-agentbundler-causing-stillborn-unicorn-processes?unresolve=true> # straces show there''s a bad file descriptor read -- presumably logfiles. > # I''ve noodled with shutting down the agent in unicorn before/after forks > # without a lot of luck. Tried newrelic as a plugin with the same issue, > # as well as some of the bundler fixes mentioned in the FAQ, as well as > # theNow that we know it''s NewRelic, I suspect it could be reading from the agent''s client socket, not a log file. You can map fd => files/sockets with "ls -l /proc/$pid/fd" or "lsof -p $pid" Perhaps in the before_fork hook you can try closing the TCPSocket (or similar) that NewRelic is using. Merely stopping the agent thread isn''t guaranteed to close the client socket properly (in fact, I can almost guaratee it won''t close the socket at the OS level). Since you''re on Linux, try putting the output of "ls -l /proc/#$$/fd" or "lsof -p #$$" in both the after_fork+before_fork hooks to get an idea of which descriptors are open across forks.> # Unicorn doesn''t play very nicely with threads, are there are any other > # manual setup/teardown methods beyond NewRelic::Agent.shutdown I could > # try to get fd''s closed properly between forks?There''s nothing inherent to Unicorn that prevents it from playing nicely with threads. It''s just not playing nicely with threaded code written without potential fork() calls in mind. -- Eric Wong
On Apr 28, 2010, at 12:05 AM, Jamie Wilkinson wrote:> Update from the trenches: I''ve traced this down to the newrelic_rpm agent > > I''ve also filed a bug with NewRelic: > http://support.newrelic.com/discussions/support/2577-newrelic-agentbundler-causing-stillborn-unicorn-processes?unresolve=trueI''m happy to announce the recently released 2.12.2 newrelic_rpm gem fixed this long-running issue for me (newrelic_rpm causing some unicorns to fail to start) http://rubygems.org/gems/newrelic_rpm/versions/2.12.2 Huge thanks to Bill Kayser from NewRelic & Eric Wong for his patient debugging! -jamie http://jamiedubs.com