Hello there -
I''ve recently been working with a customer in my capacity as a support
engineer at Engine Yard who''s having some trouble with Unicorn.
Basically, they''re finding their application being force-killed once it
reaches the default timeout. Rather than simply increasing the timeout,
we''re trying to find out _why_ their application is being killed.
Unfortunately we can''t quite do that because the application is being
force-killed without warning, making it so that the customer''s logging
can''t actually be written to the disk. (This is an intermittent issue
as opposed to something that happens 100% of the time.)
In discussing the matter internally and with our customer, I came up with a
simple monkey patch to Unicorn that _sort of_ works, but I''m having
some trouble with it once the number of Unicorn workers goes beyond one. I
originally limited it to just one worker because I wanted to limit the
possibility that multiple workers could cause problems in figuring things out --
turns out I was right.
I''m going to show the patch in two ways: 1) inline, at the bottom of
this post, and 2) by link to GitHub:
https://github.com/jaustinhughey/unicorn/blob/abort_prior_to_kill_workers/lib/unicorn/http_server.rb#L438
The general idea is that I''d like to have some way to "warn"
the application when it''s about to be killed. I''ve patched
murder_lazy_workers to send an abort signal via kill_worker, sleep for 5
seconds, then look and see if the process still exists by using Process.getpgid.
If so, it sends the original kill command, and if not, a rescue block kicks in
to prevent the raised error from Process.getpgid from making things explode.
I''ve created a simulation app, built on Rails 3.0, that uses a generic
"posts" controller to simulate a long-running request. Instead of
just throwing a straight-up sleep 65 in there, I have it running through
sleeping 1 second on a decrementing counter, and doing that 65 times. The
reason is because, assuming I''ve read the code correctly, even with my
"skip sleeping workers" commented line below, it''ll skip over
the process, thus rendering my simulation of a long-running process invalid.
However, clarification on this point is certainly welcome. You can see the app
here:
https://github.com/jaustinhughey/unicorn_test/blob/master/app/controllers/posts_controller.rb
The problem I''m running into, and where I could use some help, is when
I increase the number of Unicorn workers from one to two. When running only one
Unicorn worker, I can access my application''s
posts_controller''s index action, which has the intentionally
long-running code. At that time I tail -f unicorn.log and production.log.
Those two logs look like this with one Unicorn worker:
WITH ONE UNICORN WORKER:
=======================
production.log:
---------------
Sleeping 1 second (65 to go)?
... continued ...
Sleeping 1 second (7 to go)...
Sleeping 1 second (6 to go)...
Sleeping 1 second (5 to go)...
Caught Unicorn kill exception!
Sleeping 1 second (4 to go)...
Sleeping 1 second (3 to go)...
Sleeping 1 second (2 to go)...
Sleeping 1 second (1 to go)...
Completed 500 Internal Server Error in 65131ms
NoMethodError (undefined method `query_options'' for nil:NilClass):
app/controllers/posts_controller.rb:32:in `index''
(I think the NoMethodError issue above is due to me calling a disconnect on
ActiveRecord in the Signal.trap block, so I think that can be safely ignored.)
As you can see, the Signal.trap block inside the aforementioned posts_controller
is working in this case. Corresponding log entries in unicorn.log concur:
unicorn.log:
------------
worker=0 ready
master process ready
[2011-09-08 00:31:01 PDT] worker=0 PID: 28921 timeout hit, sending ABRT to
process 28921 then sleeping 5 seconds...
[2011-09-08 00:31:06 PDT] worker=0 PID:28921 timeout (61s > 60s), killing
reaped #<Process::Status: pid 28921 SIGKILL (signal 9)> worker=0
worker=0 ready
So with one worker, everything seems cool. But with two workers?
WITH TWO UNICORN WORKERS:
========================
production.log:
---------------
Sleeping 1 second (8 to go)...
Sleeping 1 second (7 to go)...
Sleeping 1 second (6 to go)...
Sleeping 1 second (5 to go)...
Sleeping 1 second (4 to go)...
Sleeping 1 second (3 to go)...
Sleeping 1 second (2 to go)...
Sleeping 1 second (1 to go)...
Rendered posts/index.html.erb within layouts/application (13.2ms)
Completed 200 OK in 65311ms (Views: 16.9ms | ActiveRecord: 0.5ms)
Note that there is no notice that the ABRT signal was trapped, nor is there a
NoMethodError (likely caused by disconnecting from the database) as above. Odd.
unicorn.log:
------------
Nothing. No new data whatsoever.
The only potential clue I can see at this point would be a start-up message in
unicorn.log. After increasing the number of Unicorn workers to two, I examined
unicorn.log again and found this:
master complete
I, [2011-09-08T00:34:40.499437 #29572] INFO -- : unlinking existing
socket=/var/run/engineyard/unicorn_ut.sock
I, [2011-09-08T00:34:40.499888 #29572] INFO -- : listening on
addr=/var/run/engineyard/unicorn_ut.sock fd=5
I, [2011-09-08T00:34:40.504542 #29572] INFO -- : Refreshing Gem list
worker=0 ready
master process ready
[2011-09-08 00:34:49 PDT] worker=1 PID: 29582 timeout hit, sending ABRT to
process 29582 then sleeping 5 seconds...
[2011-09-08 00:34:50 PDT] worker=1 PID:29582 timeout (1315467289s > 60s),
killing
reaped #<Process::Status: pid 29582 SIGIOT (signal 6)> worker=1
worker=1 ready
So it looks like Worker 1 is hitting a strange/false timeout of 1315467289
seconds, which isn''t really possible as it wasn''t even running
1315467289 seconds prior to that (which equates to roughly 41 years ago if my
math is right).
---
Needless to say, I''m a bit stumped at this point, and would sincerely
appreciate another point of view on this. Am I going about this all wrong? Is
there a better approach I should consider? And if I''m on the right
track, how can I get this to work regardless of how many Unicorn workers are
running?
Thank you very much for any assistance you can provide!
-- INLINE VERSION OF PATCH --
diff --git a/lib/unicorn/http_server.rb b/lib/unicorn/http_server.rb
index 78d80b4..8a2323f 100644
--- a/lib/unicorn/http_server.rb
+++ b/lib/unicorn/http_server.rb
@@ -429,6 +429,11 @@ class Unicorn::HttpServer
proc_name ''master (old)''
end
+ # A custom formatted timestamp for debugging
+ def custom_timestamp
+ return Time.now.strftime("[%Y-%m-%d %T %Z]")
+ end
+
# forcibly terminate all workers that haven''t checked in in timeout
seconds. The timeout is implemented using an unlinked File
def murder_lazy_workers
t = @timeout
@@ -436,16 +441,40 @@ class Unicorn::HttpServer
now = Time.now.to_i
WORKERS.dup.each_pair do |wpid, worker|
tick = worker.tick
- 0 == tick and next # skip workers that are sleeping
+
+ # REMOVE THE FOLLOWING COMMENT WHEN TESTING PRODUCTION
+# 0 == tick and next # skip workers that are sleeping
+ # ^ needs to be active, commented here for simulation purposes
+
diff = now - tick
tmp = t - diff
if tmp >= 0
next_sleep < tmp and next_sleep = tmp
next
end
- logger.error "worker=#{worker.nr} PID:#{wpid} timeout " \
- "(#{diff}s > #{t}s), killing"
- kill_worker(:KILL, wpid) # take no prisoners for timeout violations
+
+
+ # Send an ABRT signal to Unicorn and wait 5 seconds before attempting an
+ # actual kill, if and only if the process is still running.
+
+ begin
+ # Send the ABRT signal.
+ logger.debug "#{custom_timestamp} worker=#{worker.nr} PID: #{wpid}
timeout hit, sending ABRT to process #{wpid} then sleeping 5 seconds..."
+ kill_worker(:ABRT, wpid)
+
+ sleep 5
+
+ # Now see if the process still exists after being given five
+ # seconds to terminate on its own, and if so, do a hard kill.
+ if Process.getpgid(wpid)
+ logger.error "#{custom_timestamp} worker=#{worker.nr}
PID:#{wpid} timeout " \
+ "(#{diff}s > #{@timeout}s), killing"
+ kill_worker(:KILL, wpid) # take no prisoners for timeout violations
+ end
+ rescue Errno::ESRCH => e
+ # No process identified - maybe it exited on its own?
+ logger.debug "#{custom_timestamp} worker=#{worker.nr} PID: #{wpid}
responded to ABRT on its own and no longer exists. (Received message:
#{e})"
+ end
end
next_sleep
end
-- END INLINE VERSION OF PATCH --
--
J. Austin Hughey
Application Support Engineer - Engine Yard
www.engineyard.com | jhughey at engineyard.com