thr3ads.net - mongrel unicorn - `kill -SIGTRAP <worker pid>` to get a live ruby backtrace + generate backtrace when murdering worker due to timeout [Jun 2012]

If this information is useful, please help other people find it:
Share via:

Cedric Maion

2012-Jun-23 16:12 UTC

`kill -SIGTRAP <worker pid>` to get a live ruby backtrace + generate backtrace when murdering worker due to timeout

Hi,

The following patch allows dumping of live Ruby backtraces of running workers by
sending a TRAP signal (kill -5) to the worker PID.
The master also automatically generates a backtrace when it kills a worker due
to timeout: this helps identifying what was doing the worker and hopefully give
a hint of what was taking so much time.

Kind regards,

    Cedric

PS: not subscribed to the ML, so please CC: me when replying. Thanks!

Cedric Maion

2012-Jun-23 16:12 UTC

head link

[PATCH] `kill -SIGTRAP <worker pid>` to get a live ruby backtrace + generate backtrace when murdered worker due to timeout

---
 lib/unicorn/http_server.rb |    3 +++
 1 file changed, 3 insertions(+)

diff --git a/lib/unicorn/http_server.rb b/lib/unicorn/http_server.rb
index 14a6f9a..8507fe4 100644
--- a/lib/unicorn/http_server.rb
+++ b/lib/unicorn/http_server.rb
@@ -457,6 +457,8 @@ class Unicorn::HttpServer
       next_sleep = 0
       logger.error "worker=#{worker.nr} PID:#{wpid} timeout " \
                    "(#{diff}s > #{@timeout}s), killing"
+      kill_worker(:TRAP, wpid)
+      sleep(0.5)
       kill_worker(:KILL, wpid) # take no prisoners for timeout violations
     end
     next_sleep <= 0 ? 1 : next_sleep
@@ -594,6 +596,7 @@ class Unicorn::HttpServer
     # closing anything we IO.select on will raise EBADF
     trap(:USR1) { nr = -65536; SELF_PIPE[0].close rescue nil }
     trap(:QUIT) { worker = nil; LISTENERS.each { |s| s.close rescue nil }.clear
}
+    trap(:TRAP) { logger.info("worker=#{worker.nr} pid:#{$$} received TRAP
signal, showing backtrace:\n#{caller.join("\n")}") }
     logger.info "worker=#{worker.nr} ready"
 
     begin
-- 
1.7.9.5

Eric Wong

2012-Jun-23 18:55 UTC

head link

[PATCH] `kill -SIGTRAP <worker pid>`

Subject: [PATCH] `kill -SIGTRAP <worker pid>` to get a live ruby
 backtrace + generate backtrace when murdered worker due to timeout

Please keep Subject lines a reasonable length (git recommends the commit
message subject wrap at ~50 columns or so) and wrap code at <= 80
columns

Cedric Maion <cedric at maion.com> wrote:> ---
>  lib/unicorn/http_server.rb |    3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/lib/unicorn/http_server.rb b/lib/unicorn/http_server.rb
> index 14a6f9a..8507fe4 100644
> --- a/lib/unicorn/http_server.rb
> +++ b/lib/unicorn/http_server.rb
> @@ -457,6 +457,8 @@ class Unicorn::HttpServer
>        next_sleep = 0
>        logger.error "worker=#{worker.nr} PID:#{wpid} timeout " \
>                     "(#{diff}s > #{@timeout}s), killing"
> +      kill_worker(:TRAP, wpid)
> +      sleep(0.5)
>        kill_worker(:KILL, wpid) # take no prisoners for timeout violations
SIGKILL timeout is only a last line of defense when the Ruby VM itself
is completely broken.  Handling SIGTRAP implies the worker can still
respond (and /can/ be rescued), so your SIGTRAP handler is worthless if
SIGKILL is required to kill a process.

See http://unicorn.bogomips.org/Application_Timeouts.html

Sleeping here is also unacceptable since it blocks the main loop,
making masters signal handlers non-responsive for too long.
> @@ -594,6 +596,7 @@ class Unicorn::HttpServer
>      # closing anything we IO.select on will raise EBADF
>      trap(:USR1) { nr = -65536; SELF_PIPE[0].close rescue nil }
>      trap(:QUIT) { worker = nil; LISTENERS.each { |s| s.close rescue nil
}.clear }
> +    trap(:TRAP) { logger.info("worker=#{worker.nr} pid:#{$$} received
TRAP signal, showing backtrace:\n#{caller.join("\n")}") }
>      logger.info "worker=#{worker.nr} ready"
Using the Logger class inside a signal handler can deadlock.  Logger
attempts to acquire a non-reentrant lock when called.  Unicorn doesn''t
use threads itself, but the Rack app may use threads internally.

Thanks for your interest in unicorn!

Cedric Maion

2012-Jun-24 11:05 UTC

head link

[PATCH] `kill -SIGTRAP <worker pid>`

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi,
> Please keep Subject lines a reasonable length (git recommends the commit
> message subject wrap at ~50 columns or so) and wrap code at <= 80
> columns
ok> SIGKILL timeout is only a last line of defense when the Ruby VM itself
> is completely broken. Handling SIGTRAP implies the worker can still
> respond (and /can/ be rescued), so your SIGTRAP handler is worthless if
> SIGKILL is required to kill a process.Sure. But if the VM is responding, being able to get a backtrace is nice.
And if it''s stuck, you won''t get anything indeed, but
that''s still an
information (in that case, one may eventually want to get a gdb
backtrace too). No?> See http://unicorn.bogomips.org/Application_Timeouts.htmlYes, I''m well aware of this. However, when you still get rare unicorn
timeouts, debugging them is not obvious.
In my case, a server in a loadbalanced farm sometimes sees all it''s
unicorn workers timeout in the same minute (approx once a day at what
seems a random time) -- other servers are fine. Couldn''t correlate this
with any specific network/disk/misc system/user activity yet.
> Sleeping here is also unacceptable since it blocks the main loop,
> making masters signal handlers non-responsive for too long.
ok.>
> Using the Logger class inside a signal handler can deadlock. Logger
> attempts to acquire a non-reentrant lock when called. Unicorn
doesn''t
> use threads itself, but the Rack app may use threads internally.
ok, can be replaced with a $stdout.write then.> Thanks for your interest in unicorn!
Thanks for your feedback,
Kind regards,

    Cedric
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iQEcBAEBAgAGBQJP5vSRAAoJEA15MS+4e3PCyekH/2ffXVT5UrXt0t7iou6cH9kt
q2mDMIbotRZp2iB21K0H1QtPTgrU6h4TrfEyiz3bfgtMLDCbAcXQal6x78sjNqPh
lIzs78TKgjkzh5SfqwIAyVVXuuU5AtGJleQeG2opHTgrZUxDRSOpJGxq2sYZU/rC
OiCybOiYyh8nFudbg0v7BBTrYyCA/uWOO6zweGh0euJzrLrg0qeTMnexsEXzITkX
OWZS6ALNt6UUq/DRSfGk9ciuWes/za5NaXob/60qgyqOinDuMUaTrR+KXZfliCu0
69C/mh7qpSPc/n91qjzvjklfc9bTd2WiUPeODQLayyEZ5QVEVsLMS1zlCDlyXck=XeZq
-----END PGP SIGNATURE-----

Eric Wong

2012-Jun-25 03:59 UTC

head link

[PATCH] `kill -SIGTRAP <worker pid>`

Cedric Maion <cedric at maion.com> wrote:> Eric Wong <normalperson at yhbt.net> wrote:
> > SIGKILL timeout is only a last line of defense when the Ruby VM itself
> > is completely broken. Handling SIGTRAP implies the worker can still
> > respond (and /can/ be rescued), so your SIGTRAP handler is worthless
if
> > SIGKILL is required to kill a process.
> Sure. But if the VM is responding, being able to get a backtrace is nice.
> And if it''s stuck, you won''t get anything indeed, but
that''s still an
> information (in that case, one may eventually want to get a gdb
> backtrace too). No?
Sure it''s nice.  But the point is you should''ve had something
around to
handle it in your app anyways if your worker was capable of responding
to SIGTRAP at all.  The SIGKILL logic only exists in the master because
it must run outside of the worker.
> > See http://unicorn.bogomips.org/Application_Timeouts.html
> Yes, I''m well aware of this. However, when you still get rare
unicorn
> timeouts, debugging them is not obvious.
> In my case, a server in a loadbalanced farm sometimes sees all
it''s
> unicorn workers timeout in the same minute (approx once a day at what
> seems a random time) -- other servers are fine. Couldn''t correlate
this
> with any specific network/disk/misc system/user activity yet.
I might even crank the unicorn timeout sky high and have something
else (per-worker) handling timeouts + debugging/dumping in this case.

I recall some mailing list threads on similar topics over the years,
gmane has excellent archives and I''d start there (and not the Rubyforge
archives): gmane.org/gmane.comp.lang.ruby.unicorn.general

The Rainbows::ThreadTimeout could be used as a starting point for a Rack
middleware to debug with.

	git clone git://bogomips.org/rainbows
	cat lib/rainbows/thread_timeout.rb

mongrel unicorn - Jun 2012 - `kill -SIGTRAP <worker pid>` to get a live ruby backtrace + generate backtrace when murdering worker due to timeout

`kill -SIGTRAP <worker pid>` to get a live ruby backtrace + generate backtrace when murdering worker due to timeout

[PATCH] `kill -SIGTRAP <worker pid>` to get a live ruby backtrace + generate backtrace when murdered worker due to timeout

[PATCH] `kill -SIGTRAP <worker pid>`

[PATCH] `kill -SIGTRAP <worker pid>`

[PATCH] `kill -SIGTRAP <worker pid>`