Cedric Maion
2012-Jun-23 16:12 UTC
`kill -SIGTRAP <worker pid>` to get a live ruby backtrace + generate backtrace when murdering worker due to timeout
Hi, The following patch allows dumping of live Ruby backtraces of running workers by sending a TRAP signal (kill -5) to the worker PID. The master also automatically generates a backtrace when it kills a worker due to timeout: this helps identifying what was doing the worker and hopefully give a hint of what was taking so much time. Kind regards, Cedric PS: not subscribed to the ML, so please CC: me when replying. Thanks!
Cedric Maion
2012-Jun-23 16:12 UTC
[PATCH] `kill -SIGTRAP <worker pid>` to get a live ruby backtrace + generate backtrace when murdered worker due to timeout
--- lib/unicorn/http_server.rb | 3 +++ 1 file changed, 3 insertions(+) diff --git a/lib/unicorn/http_server.rb b/lib/unicorn/http_server.rb index 14a6f9a..8507fe4 100644 --- a/lib/unicorn/http_server.rb +++ b/lib/unicorn/http_server.rb @@ -457,6 +457,8 @@ class Unicorn::HttpServer next_sleep = 0 logger.error "worker=#{worker.nr} PID:#{wpid} timeout " \ "(#{diff}s > #{@timeout}s), killing" + kill_worker(:TRAP, wpid) + sleep(0.5) kill_worker(:KILL, wpid) # take no prisoners for timeout violations end next_sleep <= 0 ? 1 : next_sleep @@ -594,6 +596,7 @@ class Unicorn::HttpServer # closing anything we IO.select on will raise EBADF trap(:USR1) { nr = -65536; SELF_PIPE[0].close rescue nil } trap(:QUIT) { worker = nil; LISTENERS.each { |s| s.close rescue nil }.clear } + trap(:TRAP) { logger.info("worker=#{worker.nr} pid:#{$$} received TRAP signal, showing backtrace:\n#{caller.join("\n")}") } logger.info "worker=#{worker.nr} ready" begin -- 1.7.9.5
Subject: [PATCH] `kill -SIGTRAP <worker pid>` to get a live ruby backtrace + generate backtrace when murdered worker due to timeout Please keep Subject lines a reasonable length (git recommends the commit message subject wrap at ~50 columns or so) and wrap code at <= 80 columns Cedric Maion <cedric at maion.com> wrote:> --- > lib/unicorn/http_server.rb | 3 +++ > 1 file changed, 3 insertions(+) > > diff --git a/lib/unicorn/http_server.rb b/lib/unicorn/http_server.rb > index 14a6f9a..8507fe4 100644 > --- a/lib/unicorn/http_server.rb > +++ b/lib/unicorn/http_server.rb > @@ -457,6 +457,8 @@ class Unicorn::HttpServer > next_sleep = 0 > logger.error "worker=#{worker.nr} PID:#{wpid} timeout " \ > "(#{diff}s > #{@timeout}s), killing" > + kill_worker(:TRAP, wpid) > + sleep(0.5) > kill_worker(:KILL, wpid) # take no prisoners for timeout violationsSIGKILL timeout is only a last line of defense when the Ruby VM itself is completely broken. Handling SIGTRAP implies the worker can still respond (and /can/ be rescued), so your SIGTRAP handler is worthless if SIGKILL is required to kill a process. See http://unicorn.bogomips.org/Application_Timeouts.html Sleeping here is also unacceptable since it blocks the main loop, making masters signal handlers non-responsive for too long.> @@ -594,6 +596,7 @@ class Unicorn::HttpServer > # closing anything we IO.select on will raise EBADF > trap(:USR1) { nr = -65536; SELF_PIPE[0].close rescue nil } > trap(:QUIT) { worker = nil; LISTENERS.each { |s| s.close rescue nil }.clear } > + trap(:TRAP) { logger.info("worker=#{worker.nr} pid:#{$$} received TRAP signal, showing backtrace:\n#{caller.join("\n")}") } > logger.info "worker=#{worker.nr} ready"Using the Logger class inside a signal handler can deadlock. Logger attempts to acquire a non-reentrant lock when called. Unicorn doesn''t use threads itself, but the Rack app may use threads internally. Thanks for your interest in unicorn!
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Hi,> Please keep Subject lines a reasonable length (git recommends the commit > message subject wrap at ~50 columns or so) and wrap code at <= 80 > columnsok> SIGKILL timeout is only a last line of defense when the Ruby VM itself > is completely broken. Handling SIGTRAP implies the worker can still > respond (and /can/ be rescued), so your SIGTRAP handler is worthless if > SIGKILL is required to kill a process.Sure. But if the VM is responding, being able to get a backtrace is nice. And if it''s stuck, you won''t get anything indeed, but that''s still an information (in that case, one may eventually want to get a gdb backtrace too). No?> See http://unicorn.bogomips.org/Application_Timeouts.htmlYes, I''m well aware of this. However, when you still get rare unicorn timeouts, debugging them is not obvious. In my case, a server in a loadbalanced farm sometimes sees all it''s unicorn workers timeout in the same minute (approx once a day at what seems a random time) -- other servers are fine. Couldn''t correlate this with any specific network/disk/misc system/user activity yet.> Sleeping here is also unacceptable since it blocks the main loop, > making masters signal handlers non-responsive for too long.ok.> > Using the Logger class inside a signal handler can deadlock. Logger > attempts to acquire a non-reentrant lock when called. Unicorn doesn''t > use threads itself, but the Rack app may use threads internally.ok, can be replaced with a $stdout.write then.> Thanks for your interest in unicorn!Thanks for your feedback, Kind regards, Cedric -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iQEcBAEBAgAGBQJP5vSRAAoJEA15MS+4e3PCyekH/2ffXVT5UrXt0t7iou6cH9kt q2mDMIbotRZp2iB21K0H1QtPTgrU6h4TrfEyiz3bfgtMLDCbAcXQal6x78sjNqPh lIzs78TKgjkzh5SfqwIAyVVXuuU5AtGJleQeG2opHTgrZUxDRSOpJGxq2sYZU/rC OiCybOiYyh8nFudbg0v7BBTrYyCA/uWOO6zweGh0euJzrLrg0qeTMnexsEXzITkX OWZS6ALNt6UUq/DRSfGk9ciuWes/za5NaXob/60qgyqOinDuMUaTrR+KXZfliCu0 69C/mh7qpSPc/n91qjzvjklfc9bTd2WiUPeODQLayyEZ5QVEVsLMS1zlCDlyXck=XeZq -----END PGP SIGNATURE-----
Cedric Maion <cedric at maion.com> wrote:> Eric Wong <normalperson at yhbt.net> wrote: > > SIGKILL timeout is only a last line of defense when the Ruby VM itself > > is completely broken. Handling SIGTRAP implies the worker can still > > respond (and /can/ be rescued), so your SIGTRAP handler is worthless if > > SIGKILL is required to kill a process. > Sure. But if the VM is responding, being able to get a backtrace is nice. > And if it''s stuck, you won''t get anything indeed, but that''s still an > information (in that case, one may eventually want to get a gdb > backtrace too). No?Sure it''s nice. But the point is you should''ve had something around to handle it in your app anyways if your worker was capable of responding to SIGTRAP at all. The SIGKILL logic only exists in the master because it must run outside of the worker.> > See http://unicorn.bogomips.org/Application_Timeouts.html > Yes, I''m well aware of this. However, when you still get rare unicorn > timeouts, debugging them is not obvious. > In my case, a server in a loadbalanced farm sometimes sees all it''s > unicorn workers timeout in the same minute (approx once a day at what > seems a random time) -- other servers are fine. Couldn''t correlate this > with any specific network/disk/misc system/user activity yet.I might even crank the unicorn timeout sky high and have something else (per-worker) handling timeouts + debugging/dumping in this case. I recall some mailing list threads on similar topics over the years, gmane has excellent archives and I''d start there (and not the Rubyforge archives): gmane.org/gmane.comp.lang.ruby.unicorn.general The Rainbows::ThreadTimeout could be used as a starting point for a Rack middleware to debug with. git clone git://bogomips.org/rainbows cat lib/rainbows/thread_timeout.rb