thr3ads.net - mongrel unicorn - Sending ABRT to timeout-errant process before KILL [Sep 2011]

If this information is useful, please help other people find it:
Share via:

J. Austin Hughey

2011-Sep-08 09:04 UTC

Sending ABRT to timeout-errant process before KILL

Hello there -

I''ve recently been working with a customer in my capacity as a support
engineer at Engine Yard who''s having some trouble with Unicorn. 
Basically, they''re finding their application being force-killed once it
reaches the default timeout.  Rather than simply increasing the timeout,
we''re trying to find out _why_ their application is being killed. 
Unfortunately we can''t quite do that because the application is being
force-killed without warning, making it so that the customer''s logging
can''t actually be written to the disk. (This is an intermittent issue
as opposed to something that happens 100% of the time.)

In discussing the matter internally and with our customer, I came up with a
simple monkey patch to Unicorn that _sort of_ works, but I''m having
some trouble with it once the number of Unicorn workers goes beyond one.  I
originally limited it to just one worker because I wanted to limit the
possibility that multiple workers could cause problems in figuring things out --
turns out I was right.

I''m going to show the patch in two ways: 1) inline, at the bottom of
this post, and 2) by link to GitHub:
https://github.com/jaustinhughey/unicorn/blob/abort_prior_to_kill_workers/lib/unicorn/http_server.rb#L438

The general idea is that I''d like to have some way to "warn"
the application when it''s about to be killed.  I''ve patched
murder_lazy_workers to send an abort signal via kill_worker, sleep for 5
seconds, then look and see if the process still exists by using Process.getpgid.
If so, it sends the original kill command, and if not, a rescue block kicks in
to prevent the raised error from Process.getpgid from making things explode.

I''ve created a simulation app, built on Rails 3.0, that uses a generic
"posts" controller to simulate a long-running request.  Instead of
just throwing a straight-up sleep 65 in there, I have it running through
sleeping 1 second on a decrementing counter, and doing that 65 times.  The
reason is because, assuming I''ve read the code correctly, even with my
"skip sleeping workers" commented line below, it''ll skip over
the process, thus rendering my simulation of a long-running process invalid. 
However, clarification on this point is certainly welcome.  You can see the app
here:
https://github.com/jaustinhughey/unicorn_test/blob/master/app/controllers/posts_controller.rb

The problem I''m running into, and where I could use some help, is when
I increase the number of Unicorn workers from one to two.  When running only one
Unicorn worker, I can access my application''s
posts_controller''s index action, which has the intentionally
long-running code.  At that time I tail -f unicorn.log and production.log. 
Those two logs look like this with one Unicorn worker:

WITH ONE UNICORN WORKER:
=======================
production.log:
---------------
Sleeping 1 second (65 to go)?
 ... continued ...
Sleeping 1 second (7 to go)...
Sleeping 1 second (6 to go)...
Sleeping 1 second (5 to go)...
Caught Unicorn kill exception!
Sleeping 1 second (4 to go)...
Sleeping 1 second (3 to go)...
Sleeping 1 second (2 to go)...
Sleeping 1 second (1 to go)...
Completed 500 Internal Server Error in 65131ms

NoMethodError (undefined method `query_options'' for nil:NilClass):
  app/controllers/posts_controller.rb:32:in `index''

(I think the NoMethodError issue above is due to me calling a disconnect on
ActiveRecord in the Signal.trap block, so I think that can be safely ignored.)

As you can see, the Signal.trap block inside the aforementioned posts_controller
is working in this case.  Corresponding log entries in unicorn.log concur:

unicorn.log:
------------
worker=0 ready
master process ready
[2011-09-08 00:31:01 PDT] worker=0 PID: 28921 timeout hit, sending ABRT to
process 28921 then sleeping 5 seconds...
[2011-09-08 00:31:06 PDT] worker=0 PID:28921 timeout (61s > 60s), killing
reaped #<Process::Status: pid 28921 SIGKILL (signal 9)> worker=0
worker=0 ready

So with one worker, everything seems cool.  But with two workers?


WITH TWO UNICORN WORKERS:
========================
production.log:
---------------
Sleeping 1 second (8 to go)...
Sleeping 1 second (7 to go)...
Sleeping 1 second (6 to go)...
Sleeping 1 second (5 to go)...
Sleeping 1 second (4 to go)...
Sleeping 1 second (3 to go)...
Sleeping 1 second (2 to go)...
Sleeping 1 second (1 to go)...
Rendered posts/index.html.erb within layouts/application (13.2ms)
Completed 200 OK in 65311ms (Views: 16.9ms | ActiveRecord: 0.5ms)

Note that there is no notice that the ABRT signal was trapped, nor is there a
NoMethodError (likely caused by disconnecting from the database) as above.  Odd.

unicorn.log:
------------
Nothing.  No new data whatsoever.

The only potential clue I can see at this point would be a start-up message in
unicorn.log.  After increasing the number of Unicorn workers to two, I examined
unicorn.log again and found this:

master complete
I, [2011-09-08T00:34:40.499437 #29572]  INFO -- : unlinking existing
socket=/var/run/engineyard/unicorn_ut.sock
I, [2011-09-08T00:34:40.499888 #29572]  INFO -- : listening on
addr=/var/run/engineyard/unicorn_ut.sock fd=5
I, [2011-09-08T00:34:40.504542 #29572]  INFO -- : Refreshing Gem list
worker=0 ready
master process ready
[2011-09-08 00:34:49 PDT] worker=1 PID: 29582 timeout hit, sending ABRT to
process 29582 then sleeping 5 seconds...
[2011-09-08 00:34:50 PDT] worker=1 PID:29582 timeout (1315467289s > 60s),
killing
reaped #<Process::Status: pid 29582 SIGIOT (signal 6)> worker=1
worker=1 ready

So it looks like Worker 1 is hitting a strange/false timeout of 1315467289
seconds, which isn''t really possible as it wasn''t even running
1315467289 seconds prior to that (which equates to roughly 41 years ago if my
math is right).

---

Needless to say, I''m a bit stumped at this point, and would sincerely
appreciate another point of view on this.  Am I going about this all wrong?  Is
there a better approach I should consider?  And if I''m on the right
track, how can I get this to work regardless of how many Unicorn workers are
running?

Thank you very much for any assistance you can provide!


-- INLINE VERSION OF PATCH --

diff --git a/lib/unicorn/http_server.rb b/lib/unicorn/http_server.rb
index 78d80b4..8a2323f 100644
--- a/lib/unicorn/http_server.rb
+++ b/lib/unicorn/http_server.rb
@@ -429,6 +429,11 @@ class Unicorn::HttpServer
     proc_name ''master (old)''
   end
 
+  # A custom formatted timestamp for debugging
+  def custom_timestamp
+    return Time.now.strftime("[%Y-%m-%d %T %Z]")
+  end
+
   # forcibly terminate all workers that haven''t checked in in timeout
seconds.  The timeout is implemented using an unlinked File
   def murder_lazy_workers
     t = @timeout
@@ -436,16 +441,40 @@ class Unicorn::HttpServer
     now = Time.now.to_i
     WORKERS.dup.each_pair do |wpid, worker|
       tick = worker.tick
-      0 == tick and next # skip workers that are sleeping
+
+      # REMOVE THE FOLLOWING COMMENT WHEN TESTING PRODUCTION
+# 0 == tick and next # skip workers that are sleeping
+      # ^ needs to be active, commented here for simulation purposes
+
       diff = now - tick
       tmp = t - diff
       if tmp >= 0
         next_sleep < tmp and next_sleep = tmp
         next
       end
-      logger.error "worker=#{worker.nr} PID:#{wpid} timeout " \
-                   "(#{diff}s > #{t}s), killing"
-      kill_worker(:KILL, wpid) # take no prisoners for timeout violations
+
+
+      # Send an ABRT signal to Unicorn and wait 5 seconds before attempting an
+      # actual kill, if and only if the process is still running.
+
+      begin
+        # Send the ABRT signal.
+        logger.debug "#{custom_timestamp} worker=#{worker.nr} PID: #{wpid}
timeout hit, sending ABRT to process #{wpid} then sleeping 5 seconds..."
+        kill_worker(:ABRT, wpid)
+
+        sleep 5
+
+        # Now see if the process still exists after being given five
+        # seconds to terminate on its own, and if so, do a hard kill.
+        if Process.getpgid(wpid)
+          logger.error "#{custom_timestamp} worker=#{worker.nr}
PID:#{wpid} timeout " \
+                       "(#{diff}s > #{@timeout}s), killing"
+          kill_worker(:KILL, wpid) # take no prisoners for timeout violations
+        end
+      rescue Errno::ESRCH => e
+        # No process identified - maybe it exited on its own?
+        logger.debug "#{custom_timestamp} worker=#{worker.nr} PID: #{wpid}
responded to ABRT on its own and no longer exists. (Received message:
#{e})"
+      end
     end
     next_sleep
   end

-- END INLINE VERSION OF PATCH --

--
J. Austin Hughey
Application Support Engineer - Engine Yard
www.engineyard.com | jhughey at engineyard.com

Eric Wong

2011-Sep-08 19:13 UTC

head link

Sending ABRT to timeout-errant process before KILL

"J. Austin Hughey" <jhughey at engineyard.com>
wrote:> The general idea is that I''d like to have some way to
"warn" the
> application when it''s about to be killed.  I''ve patched
> murder_lazy_workers to send an abort signal via kill_worker, sleep for
> 5 seconds, then look and see if the process still exists by using
> Process.getpgid.  If so, it sends the original kill command, and if
> not, a rescue block kicks in to prevent the raised error from
> Process.getpgid from making things explode.
The problem with anything other than SIGKILL (or SIGSTOP) is that it
assumes the Ruby VM is working and in a good state.
> I''ve created a simulation app, built on Rails 3.0, that uses a
generic
> "posts" controller to simulate a long-running request.  Instead
of
> just throwing a straight-up sleep 65 in there, I have it running
> through sleeping 1 second on a decrementing counter, and doing that 65
> times.  The reason is because, assuming I''ve read the code
correctly,
> even with my "skip sleeping workers" commented line below,
it''ll skip
> over the process, thus rendering my simulation of a long-running
> process invalid.  However, clarification on this point is certainly
> welcome.  You can see the app here:
>
https://github.com/jaustinhughey/unicorn_test/blob/master/app/controllers/posts_controller.rb
(purely for educational purposes, since I''ll point you towards another
approach I believe is better)

              Signal.trap(:ABRT) do
                # Write some stuff to the Rails log
                logger.info "Caught Unicorn kill exception!"

If this is the logger that ships with Ruby, it locks a Mutex, so it''ll
deadlock if another SIGABRT is received while logging the above
statement (a very small window, admittedly).

                # Do a controlled disconnect from ActiveRecord
                ActiveRecord::Base.connection.disconnect!

Likewise, if AR needs to lock internal structures before disconnecting,
it also must be reentrant.  Ruby''s normal Mutex implementation is not
reentrant-safe.

> So it looks like Worker 1 is hitting a strange/false timeout of
> 1315467289 seconds, which isn''t really possible as it
wasn''t even
> running 1315467289 seconds prior to that (which equates to roughly 41
> years ago if my math is right).
You''re getting this because you removed the following line:

      0 == tick and next # skip workers that are sleeping

sleeping means they haven''t accepted a client connection, yet.  Not
sleeping while processing a client request.   I''ll clarify that in the
code.
> Needless to say, I''m a bit stumped at this point, and would
sincerely
> appreciate another point of view on this.  Am I going about this all
> wrong?  Is there a better approach I should consider?  And if I''m
on
> the right track, how can I get this to work regardless of how many
> Unicorn workers are running?
Since it''s an application error, it can be done as middleware.  You can
try something like the Rainbows::ThreadTimeout middleware, it''s
currently Rainbows! specific but can easily be made to work with
Unicorn.

	git clone git://bogomips.org/rainbows
	cat rainbows/lib/rainbows/thread_timeout.rb

This is conceptually similar to "timeout" in the Ruby standard
library,
but does not allow nesting.

I''ll try to clarify more later today if you have questions, in a bit of
a rush right now.

-- 
Eric Wong

Eric Wong

2011-Sep-09 23:01 UTC

head link

Sending ABRT to timeout-errant process before KILL

Eric Wong <normalperson at yhbt.net> wrote:> sleeping means they haven''t accepted a client connection, yet. 
Not
> sleeping while processing a client request.   I''ll clarify that in
the
> code.
Pushed to git://bogomips.org/unicorn.git
(commit d209910e29d4983f8346233262a49541464252c1)
> 	git clone git://bogomips.org/rainbows
> 	cat rainbows/lib/rainbows/thread_timeout.rb
> 
> This is conceptually similar to "timeout" in the Ruby standard
library,
> but does not allow nesting.
> 
> I''ll try to clarify more later today if you have questions, in a
bit of
> a rush right now.
Did you manage to get anything going based on this?  I should add that
this has the same chance of working as a SIGABRT handler written in
Ruby (but is less intrusive).

Hopefully the following diagram/chart from a previous post can explain
why you can''t rely on trappable signal handlers if the Ruby VM is in a
bad state:

  http://mid.gmane.org/20110817201323.GA24581 at dcvr.yhbt.net

-- 
Eric Wong

mongrel unicorn - Sep 2011 - Sending ABRT to timeout-errant process before KILL

Sending ABRT to timeout-errant process before KILL

Sending ABRT to timeout-errant process before KILL

Sending ABRT to timeout-errant process before KILL