Hey, So here are some tasks for managing unicorn: https://gist.github.com/1121076 I''ve found that it''s very unreliable for quitting / terminating unicorn and restarting with new code. When doing rapid deployments particularly, i''ve found that i have to go in and kill -9 the master process, and start again. any thoughts on why it seems ineffective? Thanks. PS: here''s the unicorn config: https://gist.github.com/0dd07c5ad00c56d161c7 James
> I''ve found that it''s very unreliable for quitting / terminating > unicorn and restarting with new code. When doing rapid deployments > particularly, i''ve found that i have to go in and kill -9 the master > process, and start again.We''ve noticed some of this behavior as well. We''ve seen the new master start spinning and consume 100% cpu (according to top). The old master and workers are still running and working, but the new master just hangs, and we have to kill -9. Our solution was to add the following to our unicorn config, which *seems* to have solved the problem: Unicorn::HttpServer::START_CTX[0] "#{path_to_app}/shared/bundle/ruby/1.9.1/bin/unicorn_rails" This is outlined a bit here: http://unicorn.bogomips.org/Sandbox.html. I''m not really sure why this seems to have fixed the problem -- all I can tell you is that we haven''t seen it since adding this line to the config. - Alex
Your config has: sig = (worker.nr + 1) >= server.worker_processes ? :TERM : :TTOU Process.kill(sig, File.read(old_pid).to_i) I don''t have such a conditional in mine, and I use a QUIT signal (although I don''t see why TERM isn''t working for you). Process.kill("QUIT", File.read(old_pid).to_i) You might also want to try logging a puts File.read(old_pid) to the unicorn.stdout.log to ensure the old_pid file is being correctly written to (and in the right path).
James Cox <james at imaj.es> wrote:> Hey, > > So here are some tasks for managing unicorn: > > https://gist.github.com/1121076Can we ignore the :restart task? It''s a bit fragile since it doesn''t wait for the old process to die (SIGTERM means kill-as-quickly-as-possible, but given a loaded system it can still take some time).> I''ve found that it''s very unreliable for quitting / terminating > unicorn and restarting with new code. When doing rapid deployments > particularly, i''ve found that i have to go in and kill -9 the master > process, and start again.If you SIGQUIT/SIGTERM before the app is loaded, the signal could be ignored. This behavior should probably change...> any thoughts on why it seems ineffective? > > Thanks.Which version of Unicorn are you using? I recall some minor tweaks to avoid/minimize race conditions over the years so maybe some are fixed.> PS: here''s the unicorn config: > > https://gist.github.com/0dd07c5ad00c56d161c7Avoid the top piece of the before_fork hook to TTOU workers, it''s needlessly complex for most deployments unless you''re really memory-constrained. -- Eric Wong
Alex Sharp <ajsharp at gmail.com> wrote:> > I''ve found that it''s very unreliable for quitting / terminating > > unicorn and restarting with new code. When doing rapid deployments > > particularly, i''ve found that i have to go in and kill -9 the master > > process, and start again. > > We''ve noticed some of this behavior as well. We''ve seen the new master > start spinning and consume 100% cpu (according to top). The old master > and workers are still running and working, but the new master just > hangs, and we have to kill -9.Can you try to strace (or equivalent) the old master to see what''s using 100% CPU? -- Eric Wong
On Tue, Aug 2, 2011 at 17:53, Eric Wong <normalperson at yhbt.net> wrote:> James Cox <james at imaj.es> wrote: >> Hey, >> >> So here are some tasks for managing unicorn: >> >> https://gist.github.com/1121076 > > Can we ignore the :restart task? ? It''s a bit fragile since it doesn''t > wait for the old process to die (SIGTERM means > kill-as-quickly-as-possible, but given a loaded system it can still take > some time).We mostly restart (what surprise). What pattern works best here for that? (speed of deploy is important, but definitely assume a long boot time)> >> I''ve found that it''s very unreliable for quitting / terminating >> unicorn and restarting with new code. When doing rapid deployments >> particularly, i''ve found that i have to go in and kill -9 the master >> process, and start again. > > If you SIGQUIT/SIGTERM before the app is loaded, the signal could > be ignored. ?This behavior should probably change... > >> any thoughts on why it seems ineffective? >> >> Thanks. > > Which version of Unicorn are you using? ?I recall some minor tweaks > to avoid/minimize race conditions over the years so maybe some > are fixed. >gem ''unicorn'' - so whatever seems up to date. My lock says 4.0.1>> PS: here''s the unicorn config: >> >> https://gist.github.com/0dd07c5ad00c56d161c7 > > Avoid the top piece of the before_fork hook to TTOU workers, it''s > needlessly complex for most deployments unless you''re really > memory-constrained. >So what should that look like? all but that nr-workers stuff? can i just remove the before fork? and what would you say is a super good unicorn config to start from? thanks! james
James Cox <james at imaj.es> wrote:> On Tue, Aug 2, 2011 at 17:53, Eric Wong <normalperson at yhbt.net> wrote: > > James Cox <james at imaj.es> wrote: > >> Hey, > >> > >> So here are some tasks for managing unicorn: > >> > >> https://gist.github.com/1121076 > > > > Can we ignore the :restart task? ? It''s a bit fragile since it doesn''t > > wait for the old process to die (SIGTERM means > > kill-as-quickly-as-possible, but given a loaded system it can still take > > some time). > > We mostly restart (what surprise). What pattern works best here for > that? (speed of deploy is important, but definitely assume a long boot > time)Yeah, if you do a full shutdown, I would definitely wait a little for the old process to shut down (wait for pid to disappear), first. However, full shutdown means you drop connections, so I would use USR2 followed by QUIT (on same pid, USR2 will replace the current pid file). See http://unicorn.bogomips.org/SIGNALS.html> So what should that look like? all but that nr-workers stuff? can i > just remove the before fork? and what would you say is a super good > unicorn config to start from?Yeah, skip the before_fork and also after_fork. Those are mainly for disconnect/reconnect of ActiveRecord and anything else that might need a network connection. I try to have as little-as-possible in my unicorn config. -- Eric Wong
On Tue, Aug 2, 2011 at 4:08 PM, Eric Wong <normalperson at yhbt.net> wrote:> James Cox <james at imaj.es> wrote: >> So what should that look like? all but that nr-workers stuff? can i >> just remove the before fork? and what would you say is a super good >> unicorn config to start from? > > Yeah, skip the before_fork and also after_fork. ?Those are mainly for > disconnect/reconnect of ActiveRecord and anything else that might need a > network connection.FWIW, we use the before_fork hook to automatically kill the old master by sending it QUIT: before_fork do |server, worker| old_pid = ''/var/www/api/shared/pids/unicorn.pid.oldbin'' if File.exists?(old_pid) && server.pid != old_pid begin Process.kill("QUIT", File.read(old_pid).to_i) rescue Errno::ENOENT, Errno::ESRCH # another newly forked workers has already killed it end end end - alex
Eric Wong <normalperson at yhbt.net> wrote:> If you SIGQUIT/SIGTERM before the app is loaded, the signal could > be ignored. This behavior should probably change...I pushed the following to git://bogomips.org/unicorn>From 406b8b0e2ed6e5be34d8ec3cd4b16048233c2856 Mon Sep 17 00:00:00 2001From: Eric Wong <normalperson at yhbt.net> Date: Tue, 2 Aug 2011 23:52:14 +0000 Subject: [PATCH] trap death signals in the worker sooner This helps close a race condition preventing shutdown if loading the application (preload_app=false) takes a long time and the user decides to kil workers instead. --- lib/unicorn/http_server.rb | 8 ++++++-- 1 files changed, 6 insertions(+), 2 deletions(-) diff --git a/lib/unicorn/http_server.rb b/lib/unicorn/http_server.rb index ad5d6f0..565f132 100644 --- a/lib/unicorn/http_server.rb +++ b/lib/unicorn/http_server.rb @@ -535,12 +535,17 @@ class Unicorn::HttpServer handle_error(client, e) end + EXIT_SIGS = [ :QUIT, :TERM, :INT ] + WORKER_QUEUE_SIGS = QUEUE_SIGS - EXIT_SIGS + # gets rid of stuff the worker has no business keeping track of # to free some resources and drops all sig handlers. # traps for USR1, USR2, and HUP may be set in the after_fork Proc # by the user. def init_worker_process(worker) - QUEUE_SIGS.each { |sig| trap(sig, nil) } + # we''ll re-trap :QUIT later for graceful shutdown iff we accept clients + EXIT_SIGS.each { |sig| trap(sig) { exit!(0) } } + WORKER_QUEUE_SIGS.each { |sig| trap(sig, nil) } trap(:CHLD, ''DEFAULT'') SIG_QUEUE.clear proc_name "worker[#{worker.nr}]" @@ -578,7 +583,6 @@ class Unicorn::HttpServer # closing anything we IO.select on will raise EBADF trap(:USR1) { nr = -65536; SELF_PIPE[0].close rescue nil } trap(:QUIT) { worker = nil; LISTENERS.each { |s| s.close rescue nil }.clear } - [:TERM, :INT].each { |sig| trap(sig) { exit!(0) } } # instant shutdown logger.info "worker=#{worker.nr} ready" begin -- Eric Wong
Eric Wong <normalperson at yhbt.net> wrote:> Eric Wong <normalperson at yhbt.net> wrote: > > If you SIGQUIT/SIGTERM before the app is loaded, the signal could > > be ignored. This behavior should probably change... > > I pushed the following to git://bogomips.org/unicorn > > From 406b8b0e2ed6e5be34d8ec3cd4b16048233c2856 Mon Sep 17 00:00:00 2001Also pushed to rubygems.org for easier testing: gem install --pre unicorn -v 4.0.1.4.g406b This will probably be unicorn 4.0.2 shortly.
On Tue, Aug 2, 2011 at 2:54 PM, Eric Wong <normalperson at yhbt.net> wrote:> Can you try to strace (or equivalent) the old master to see what''s using > 100% CPU? >All I see is a whole lot of this: sched_yield() = 0 sched_yield() = 0 sched_yield() = 0 sched_yield() = 0 sched_yield() = 0 sched_yield() = 0 sched_yield() = 0 sched_yield() = 0 - alex
On Thu, Aug 4, 2011 at 9:09 PM, Alex Sharp <ajsharp at gmail.com> wrote:> On Tue, Aug 2, 2011 at 2:54 PM, Eric Wong <normalperson at yhbt.net> wrote: >> Can you try to strace (or equivalent) the old master to see what''s using >> 100% CPU? >> > > All I see is a whole lot of this: > > sched_yield() ? ? ? ? ? ? ? ? ? ? ? ? ? = 0 > sched_yield() ? ? ? ? ? ? ? ? ? ? ? ? ? = 0 > sched_yield() ? ? ? ? ? ? ? ? ? ? ? ? ? = 0 > sched_yield() ? ? ? ? ? ? ? ? ? ? ? ? ? = 0 > sched_yield() ? ? ? ? ? ? ? ? ? ? ? ? ? = 0 > sched_yield() ? ? ? ? ? ? ? ? ? ? ? ? ? = 0 > sched_yield() ? ? ? ? ? ? ? ? ? ? ? ? ? = 0 > sched_yield() ? ? ? ? ? ? ? ? ? ? ? ? ? = 0Actually, my fault -- the last email was the output of new master. Running strace on the old master shows the following: select(4, [3], NULL, NULL, {13, 466686}) = 0 (Timeout) wait4(-1, 0x7fff57e7bfcc, WNOHANG, NULL) = 0 clock_gettime(CLOCK_REALTIME, {1312517345, 425411398}) = 0 fstat(9, {st_mode=S_IFREG, st_size=0, ...}) = 0 clock_gettime(CLOCK_REALTIME, {1312517345, 425625251}) = 0 fstat(11, {st_mode=S_IFREG, st_size=0, ...}) = 0 clock_gettime(CLOCK_REALTIME, {1312517345, 425779281}) = 0 fstat(12, {st_mode=S_IFREG, st_size=0, ...}) = 0 clock_gettime(CLOCK_REALTIME, {1312517345, 425927762}) = 0 The first line was when the master was idle, and then I threw a few requests at it. - alex
Alex Sharp <ajsharp at gmail.com> wrote:> On Thu, Aug 4, 2011 at 9:09 PM, Alex Sharp <ajsharp at gmail.com> wrote: > > On Tue, Aug 2, 2011 at 2:54 PM, Eric Wong <normalperson at yhbt.net> wrote: > >> Can you try to strace (or equivalent) the old master to see what''s using > >> 100% CPU? > >> > > Actually, my fault -- the last email was the output of new master. > Running strace on the old master shows the following:Is this Unicorn 3.x? Which 3.x version exactly? Maybe give 4.0.1 or the 4.0.2 git prerelease a try, too.> select(4, [3], NULL, NULL, {13, 466686}) = 0 (Timeout) > wait4(-1, 0x7fff57e7bfcc, WNOHANG, NULL) = 0 > clock_gettime(CLOCK_REALTIME, {1312517345, 425411398}) = 0 > fstat(9, {st_mode=S_IFREG, st_size=0, ...}) = 0 > clock_gettime(CLOCK_REALTIME, {1312517345, 425625251}) = 0 > fstat(11, {st_mode=S_IFREG, st_size=0, ...}) = 0 > clock_gettime(CLOCK_REALTIME, {1312517345, 425779281}) = 0 > fstat(12, {st_mode=S_IFREG, st_size=0, ...}) = 0 > clock_gettime(CLOCK_REALTIME, {1312517345, 425927762}) = 0Can I get more? (until the next select() call, at least). I''d like to see the timeout argument that gets passed to the next select(). If you see select() with very small timeout, use "strace -v" to get the st_ctime from fstat()s... I could have a math bug in murder_lazy_workers (please read/review the logic in that method, I haven''t noticed issues myself). I made some tweaks to the master sleep strategy within the past year to reduce wakeups during idle hours. This is intended to go with Ruby 1.9.3 which will have /much/ better thread wakeup strategy that reduces power consumption during idle times.> The first line was when the master was idle, and then I threw a few > requests at it.Are all workers responding as expected and not dying? -- Eric Wong
On Fri, Aug 5, 2011 at 1:07 AM, Eric Wong <normalperson at yhbt.net> wrote:> Is this Unicorn 3.x? ?Which 3.x version exactly? ?Maybe give 4.0.1 > or the 4.0.2 git prerelease a try, too.We''re using version 3.6.2 with ruby 1.9.2-p180 on ubuntu 11.x. I know the kernel on this version of ubuntu has a know signal trapping bug, but I don''t think that''s what happening here. The processes respond after I send the restart signal to them again (USR2 + QUIT).> Can I get more? (until the next select() call, at least). ?I''d like to > see the timeout argument that gets passed to the next select().Here''s some more output. This is from the old master though, not a new master that is pegging the CPU. In this instance it''s almost like the old master just ignores the signal. select(4, [3], NULL, NULL, {8, 129984}) = 0 (Timeout) wait4(-1, 0x7fff16b82e4c, WNOHANG, NULL) = 0 clock_gettime(CLOCK_REALTIME, {1313554942, 259408364}) = 0 fstat(12, {st_dev=makedev(202, 1), st_ino=20373, st_mode=S_IFREG, st_nlink=0, st_uid=1001, st_gid=1001, st_blksize=4096, st_blocks=0, st_size=0, st_atime=2011/08/12-23:14:20, st_mtime=2011/08/12-23:14:20, st_ctime=2011/08/17-04:22:21}) = 0 clock_gettime(CLOCK_REALTIME, {1313554942, 259775504}) = 0 fstat(13, {st_dev=makedev(202, 1), st_ino=20381, st_mode=S_IFREG, st_nlink=0, st_uid=1001, st_gid=1001, st_blksize=4096, st_blocks=0, st_size=0, st_atime=2011/08/12-23:14:20, st_mtime=2011/08/12-23:14:20, st_ctime=2011/08/17-04:22:17}) = 0 clock_gettime(CLOCK_REALTIME, {1313554942, 259936816}) = 0 fstat(14, {st_dev=makedev(202, 1), st_ino=20382, st_mode=S_IFREG, st_nlink=0, st_uid=1001, st_gid=1001, st_blksize=4096, st_blocks=0, st_size=0, st_atime=2011/08/12-23:14:21, st_mtime=2011/08/12-23:14:21, st_ctime=2011/08/17-04:22:19}) = 0 clock_gettime(CLOCK_REALTIME, {1313554942, 260086380}) = 0 fstat(15, {st_dev=makedev(202, 1), st_ino=20185, st_mode=S_IFREG, st_nlink=0, st_uid=1001, st_gid=1001, st_blksize=4096, st_blocks=0, st_size=0, st_atime=2011/08/12-23:14:21, st_mtime=2011/08/12-23:14:21, st_ctime=2011/08/17-04:22:17}) = 0 clock_gettime(CLOCK_REALTIME, {1313554942, 260235797}) = 0 fstat(16, {st_dev=makedev(202, 1), st_ino=20255, st_mode=S_IFREG, st_nlink=0, st_uid=1001, st_gid=1001, st_blksize=4096, st_blocks=0, st_size=0, st_atime=2011/08/12-23:14:21, st_mtime=2011/08/12-23:14:21, st_ctime=2011/08/17-04:22:19}) = 0 clock_gettime(CLOCK_REALTIME, {1313554942, 260384849}) = 0 fstat(17, {st_dev=makedev(202, 1), st_ino=20383, st_mode=S_IFREG, st_nlink=0, st_uid=1001, st_gid=1001, st_blksize=4096, st_blocks=0, st_size=0, st_atime=2011/08/12-23:14:21, st_mtime=2011/08/12-23:14:21, st_ctime=2011/08/17-04:22:19}) = 0 clock_gettime(CLOCK_REALTIME, {1313554942, 260534792}) = 0 fstat(18, {st_dev=makedev(202, 1), st_ino=20384, st_mode=S_IFREG, st_nlink=0, st_uid=1001, st_gid=1001, st_blksize=4096, st_blocks=0, st_size=0, st_atime=2011/08/12-23:14:21, st_mtime=2011/08/12-23:14:21, st_ctime=2011/08/17-04:22:19}) = 0 clock_gettime(CLOCK_REALTIME, {1313554942, 260684278}) = 0 fstat(10, {st_dev=makedev(202, 1), st_ino=20196, st_mode=S_IFREG, st_nlink=0, st_uid=1001, st_gid=1001, st_blksize=4096, st_blocks=0, st_size=0, st_atime=2011/08/16-16:33:46, st_mtime=2011/08/16-16:33:46, st_ctime=2011/08/17-04:22:19}) = 0 clock_gettime(CLOCK_REALTIME, {1313554942, 260833725}) = 0 select(4, [3], NULL, NULL, {8, 976047}> If you see select() with very small timeout, use "strace -v" to get the > st_ctime from fstat()s... > > I could have a math bug in murder_lazy_workers (please read/review the > logic in that method, I haven''t noticed issues myself). > > I made some tweaks to the master sleep strategy within the past year to > reduce wakeups during idle hours. ?This is intended to go with Ruby > 1.9.3 which will have /much/ better thread wakeup strategy that reduces > power consumption during idle times. > >> The first line was when the master was idle, and then I threw a few >> requests at it. > > Are all workers responding as expected and not dying?The old workers appear to be serving requests. - Alex
Alex Sharp <ajsharp at gmail.com> wrote:> On Fri, Aug 5, 2011 at 1:07 AM, Eric Wong <normalperson at yhbt.net> wrote: > > Can I get more? (until the next select() call, at least). ?I''d like to > > see the timeout argument that gets passed to the next select(). > > Here''s some more output. This is from the old master though, not a new > master that is pegging the CPU. In this instance it''s almost like the > old master just ignores the signal.Wait, weren''t you trying to track down a problem with the master that''s pegging the CPU? Isn''t the CPU usage the problem here? I was also thinking the CPU usage in the new master was caused by constant worker respawning because bundler wasn''t loaded correctly somehow....> select(4, [3], NULL, NULL, {8, 129984}) = 0 (Timeout)<snip>> select(4, [3], NULL, NULL, {8, 976047}OK, so /this/ master is confirmed to be sleeping with a reasonable timeout and not pegging the CPU... If you want to actually track down whether or not a signal is being delivered, use "strace -f -e ''!futex''" since Ruby 1.9 has a dedicated signal handling thread (at the OS level). (You only need the -e ''!futex'' part to filter out the noise from the signal handling thread in <= 1.9.2, 1.9.3 is much quieter :) Below is a proposed patch (to unicorn.git) which may help debug issues in the signal -> handler master path (but only once it enters the Ruby runtime). I''m a hesitant to commit it since it worthless if the Ruby process is stuck because of some bad C extension. That''s the most common cause of stuck/unresponsive processes I''ve seen. Subject: [PATCH] http_server: add debug statements for master sig handlers This should help users track down what''s going on as soon as Ruby can process the signal. It''s still not going to be useful if the Ruby process is stuck because of a bug in a C extension or Ruby itself, though. Unicorn always defers signals in the master process to serialize signal handling since some of the important actions (e.g. HUP, USR1, USR2) are NOT reentrant. --- lib/unicorn/http_server.rb | 13 +++++++++++-- 1 files changed, 11 insertions(+), 2 deletions(-) diff --git a/lib/unicorn/http_server.rb b/lib/unicorn/http_server.rb index aa8212e..b45d6d6 100644 --- a/lib/unicorn/http_server.rb +++ b/lib/unicorn/http_server.rb @@ -126,8 +126,17 @@ class Unicorn::HttpServer # setup signal handlers before writing pid file in case people get # trigger happy and send signals as soon as the pid file exists. # Note that signals don''t actually get handled until the #join method - QUEUE_SIGS.each { |sig| trap(sig) { SIG_QUEUE << sig; awaken_master } } - trap(:CHLD) { awaken_master } + QUEUE_SIGS.each do |sig| + trap(sig) do + @logger.debug("received SIG#{sig}") + SIG_QUEUE << sig + awaken_master + end + end + trap(:CHLD) do + @logger.debug("received SIGCHLD") + awaken_master + end self.pid = config[:pid] self.master_pid = $$ -- Eric Wong
Eric Wong <normalperson at yhbt.net> wrote:> Below is a proposed patch (to unicorn.git) which may help debug issues > in the signal -> handler master path (but only once it enters the Ruby > runtime). I''m a hesitant to commit it since it worthless if the Ruby > process is stuck because of some bad C extension. That''s the most > common cause of stuck/unresponsive processes I''ve seen.I think that was a bad patch, adding signal handler debugging at the Ruby layer leads to the false assumption that interpreter/VM is in a good state. If you need to debug signal handlers, something is already broken and tracing syscalls is the most reliable way to go. Ruby (and any other high-level language) signal handling is not straight forward[1]. Here''s how things work in Matz Ruby 1.9.x[2]: you C timer thread Ruby Thread(s) ------------------------------------------------------------------- traps signals ignores most signals sleeps runs Ruby... kill -USR2 ... receives signal (async) runs (system) sighandler[1] wakes up from sleep signals Ruby Thread(s) *hopefully wakes up* runs Ruby sighandler The "*hopefully wakes up*" part is the part most likely to fail as a result of a bad C extension or Ruby bug. PS. In Ruby 1.9.3, timer thread uses the "self-pipe" sighandler implementation that the unicorn master process always used. This allows Ruby 1.9.3 to conserve power on idle processes. In 1.9.2, the timer thread signal handler just polls in 10ms intervals to check if any signals were received. This is why "strace -f" is noisy and I recommend "-e ''!futex''" for 1.9.2. PPS. Unicorn still uses the "self-pipe" signal handler in Ruby-land because Ruby signal handlers are reentrant so must execute reentrant-safe code. So without the self-pipe to serialize the signal handler dispatch, the Ruby signal handler execution can nest and overlap execution with itself. This means if USR2 is sent multiple times in short succession, you could spawn multiple new unicorn masters [1] - See "man 7 signal" in Linux manpages or POSIX specs for the small list of safe functions that may be called in system-level sighandlers. Ruby-level signal handlers can''t run inside system-level signal handlers for this reason. [2] - I think any high-level language that implements signal handlers AND native threads must do something similar. The only valid variation I can think of is to execute the high-level language code inside the timer thread, but that requires the coders of the high-level language to have thread-safety (not just reentrancy) in mind when writing signal handlers even if the rest of their code uses no threads. -- Eric Wong
Ok, thanks a lot for the for the patch attempt and subsequent in-depth explanation. I know this thread is long-lived and a bit confusing, mostly because I''ve seen two different sets of scenarios that yield the same result, which is that unicorn does not restart. The first scenario (and the one that started this thread) was that unicorn actually receives the signal, spawns a new master, and that process pegs the cpu and spins into eternity. The second scenario is that the old master seems to completely ignore the USR2 signal the first time, and when sent again, properly restarts. In both scenarios the old master and workers continue to serve the old code. I thought that setting the Unicorn::HttpServer::START_CTX[0] in my unicorn config had solved the first scenario (pegged cpu), but I found a new occurrence of it today, and I now have some new strace output for this scenario (strace -v): select(4, [3], NULL, NULL, {60, 727229}) = 0 (Timeout) wait4(-1, 0x7ffffd70f72c, WNOHANG, NULL) = 0 clock_gettime(CLOCK_REALTIME, {1313708081, 720757662}) = 0 fstat(8, {st_dev=makedev(202, 1), st_ino=39806, st_mode=S_IFREG, st_nlink=0, st_uid=1001, st_gid=1001, st_blksize=4096, st_blocks=0, st_size=0, st_atime=2011/08/18-14:32:10, st_mtime=2011/08/18-14:32:10, st_ctime=2011/08/18-22:54:06}) = 0 clock_gettime(CLOCK_REALTIME, {1313708081, 721131305}) = 0 fstat(10, {st_dev=makedev(202, 1), st_ino=45370, st_mode=S_IFREG, st_nlink=0, st_uid=1001, st_gid=1001, st_blksize=4096, st_blocks=0, st_size=0, st_atime=2011/08/18-14:32:10, st_mtime=2011/08/18-14:32:10, st_ctime=2011/08/18-22:54:05}) = 0 clock_gettime(CLOCK_REALTIME, {1313708081, 721290800}) = 0 select(4, [3], NULL, NULL, {565, 34005}) = 0 (Timeout) wait4(-1, 0x7ffffd70f72c, WNOHANG, NULL) = 0 clock_gettime(CLOCK_REALTIME, {1313708646, 853870540}) = 0 fstat(8, {st_dev=makedev(202, 1), st_ino=39806, st_mode=S_IFREG, st_nlink=0, st_uid=1001, st_gid=1001, st_blksize=4096, st_blocks=0, st_size=0, st_atime=2011/08/18-14:32:10, st_mtime=2011/08/18-14:32:10, st_ctime=2011/08/18-22:59:06}) = 0 clock_gettime(CLOCK_REALTIME, {1313708646, 854102750}) = 0 fstat(10, {st_dev=makedev(202, 1), st_ino=45370, st_mode=S_IFREG, st_nlink=0, st_uid=1001, st_gid=1001, st_blksize=4096, st_blocks=0, st_size=0, st_atime=2011/08/18-14:32:10, st_mtime=2011/08/18-14:32:10, st_ctime=2011/08/18-23:04:05}) = 0 clock_gettime(CLOCK_REALTIME, {1313708646, 854260655}) = 0 select(4, [3], NULL, NULL, {598, 630876} With respect to the second scenario (ignoring signals), I''m going to recommend we table that issue for now, as we''re currently running on a version of ubuntu (11.10) which has a known signal trapping bug with ruby 1.9.2-p180, and downgrading to 10.04 may solve that problem. Granted, when I''ve observed this in the past with other libraries, the process becomes completely unresponsive, whereas the behavior in unicorn is more intermittent. Either way, we are in the process of downgrading our servers to ubuntu 10.04, so let''s not waste anymore time trying to debug something that may well be occurring due to a kernel bug. If we continue to see this after the downgrade to 10.04, I''ll report back and we can keep digging. Again, my apologies for the confusion between the two scenarios, and thanks for all your help. - Alex
Alex Sharp <ajsharp at gmail.com> wrote:> Ok, thanks a lot for the for the patch attempt and subsequent in-depth > explanation. I know this thread is long-lived and a bit confusing, > mostly because I''ve seen two different sets of scenarios that yield > the same result, which is that unicorn does not restart.No problem, I''m here to help and mutt makes it easy to follow long threads :> <snip>> I thought that setting the Unicorn::HttpServer::START_CTX[0] in my > unicorn config had solved the first scenario (pegged cpu), but I found > a new occurrence of it today, and I now have some new strace output > for this scenario (strace -v): > > select(4, [3], NULL, NULL, {60, 727229}) = 0 (Timeout) > wait4(-1, 0x7ffffd70f72c, WNOHANG, NULL) = 0 > clock_gettime(CLOCK_REALTIME, {1313708081, 720757662}) = 0 > fstat(8, {st_dev=makedev(202, 1), st_ino=39806, st_mode=S_IFREG, > st_nlink=0, st_uid=1001, st_gid=1001, st_blksize=4096, st_blocks=0, > st_size=0, st_atime=2011/08/18-14:32:10, st_mtime=2011/08/18-14:32:10, > st_ctime=2011/08/18-22:54:06}) = 0 > clock_gettime(CLOCK_REALTIME, {1313708081, 721131305}) = 0 > fstat(10, {st_dev=makedev(202, 1), st_ino=45370, st_mode=S_IFREG, > st_nlink=0, st_uid=1001, st_gid=1001, st_blksize=4096, st_blocks=0, > st_size=0, st_atime=2011/08/18-14:32:10, st_mtime=2011/08/18-14:32:10, > st_ctime=2011/08/18-22:54:05}) = 0 > clock_gettime(CLOCK_REALTIME, {1313708081, 721290800}) = 0 > select(4, [3], NULL, NULL, {565, 34005}) = 0 (Timeout)<snip> OK, this looks good from the system call level in the main thread of the master process: I can see that from select() timing out despite being called with long intervals. I don''t see workers dying, either... Can you add "-f -e ''!futex''" to the strace invocation so we see all the threads? You can also add ''-T'' to get timing information to see how long each syscall takes or ''-tt'' to get timestamps in strace if you think it''s useful. Since the main thread doesn''t appear to be doing much, it''s probably a child thread somewhere... Also, I assume you''re using preload_app? If you are, do you see this issue with preload_app=false? I suspect there''s some background thread that could be running in the master process taking up all the CPU time. Unicorn itself never spawns background threads. Basically anything you can tell use about the app, the configuration, and which gems/libraries would be useful.> With respect to the second scenario (ignoring signals), I''m going to > recommend we table that issue for now, as we''re currently running on a > version of ubuntu (11.10) which has a known signal trapping bug with > ruby 1.9.2-p180, and downgrading to 10.04 may solve that problem. > Granted, when I''ve observed this in the past with other libraries, the > process becomes completely unresponsive, whereas the behavior in > unicorn is more intermittent. Either way, we are in the process of > downgrading our servers to ubuntu 10.04, so let''s not waste anymore > time trying to debug something that may well be occurring due to a > kernel bug. If we continue to see this after the downgrade to 10.04, > I''ll report back and we can keep digging.OK. I wouldn''t rule out the CPU usage as unrelated to the signaling issue, though. The Ruby 1.9 timer thread could be going crazy because it can''t signal or receive signals properly... -- Eric Wong
On Thu, Aug 18, 2011 at 6:53 PM, Eric Wong <normalperson at yhbt.net> wrote:> ?I can see that from select() timing out despite being called with > ?long intervals. ?I don''t see workers dying, either...That''s correct, the old workers are staying alive as well.> Can you add "-f -e ''!futex''" to the strace invocation so we see > all the threads? ?You can also add ''-T'' to get timing information > to see how long each syscall takes or ''-tt'' to get timestamps > in strace if you think it''s useful. > > Also, I assume you''re using preload_app? ?If you are, do you see this > issue with preload_app=false? ?I suspect there''s some background thread > that could be running in the master process taking up all the CPU time. > Unicorn itself never spawns background threads.Yes, we''re using preload_app=true. I haven''t tried it with preload_app=false -- if I get to it tomorrow I''ll report back here. We *are* using newrelic, which operates in a background thread. I''ve just found this ticket in the newrelic support forums: https://support.newrelic.com/help/discussions/support/7692-newrelic_rpm-gem-with-unicorn-40. I''ll re-run strace with the suggested flags above and report back.> Basically anything you can tell use about the app, the configuration, > and which gems/libraries would be useful.Gemfile: https://gist.github.com/05a9445471ad7edfdcb7> OK. ?I wouldn''t rule out the CPU usage as unrelated to the signaling > issue, though. ?The Ruby 1.9 timer thread could be going crazy > because it can''t signal or receive signals properly...-- alex sharp github.com/ajsharp twitter.com/ajsharp alexjsharp.com
Alex Sharp <ajsharp at gmail.com> wrote:> On Thu, Aug 18, 2011 at 6:53 PM, Eric Wong <normalperson at yhbt.net> wrote: > > Basically anything you can tell use about the app, the configuration, > > and which gems/libraries would be useful. > > Gemfile: https://gist.github.com/05a9445471ad7edfdcb7OK, out of the few gems I recognize, I don''t see anything else besides newrelic_rpm that might run a background thread. I would scan through the rest of them to be sure (and maybe other folks here can give you pointers). Personally, I never depend on anything before I''ve had a chance to scan through the code and know what it does, first. -- Eric Wong
Eric Wong <normalperson at yhbt.net> wrote:> Eric Wong <normalperson at yhbt.net> wrote: > > If you SIGQUIT/SIGTERM before the app is loaded, the signal could > > be ignored. This behavior should probably change... > > I pushed the following to git://bogomips.org/unicorn > > From 406b8b0e2ed6e5be34d8ec3cd4b16048233c2856 Mon Sep 17 00:00:00 2001 > From: Eric Wong <normalperson at yhbt.net> > Date: Tue, 2 Aug 2011 23:52:14 +0000 > Subject: [PATCH] trap death signals in the worker sooner > > This helps close a race condition preventing shutdown if > loading the application (preload_app=false) takes a long > time and the user decides to kil workers instead.The following completely eliminates the race condition:>From 8de6ab371c1623669b86a5dfa8703c8fd539011f Mon Sep 17 00:00:00 2001From: Eric Wong <normalperson at yhbt.net> Date: Fri, 19 Aug 2011 22:13:04 +0000 Subject: [PATCH] close race if an exit signal hits the worker before trap The signal handler from the master is still active and will push the pending signal to SIG_QUEUE if a worker receives a signal immediately after forking. --- lib/unicorn/http_server.rb | 1 + 1 files changed, 1 insertions(+), 0 deletions(-) diff --git a/lib/unicorn/http_server.rb b/lib/unicorn/http_server.rb index aa8212e..4f516c9 100644 --- a/lib/unicorn/http_server.rb +++ b/lib/unicorn/http_server.rb @@ -549,6 +549,7 @@ class Unicorn::HttpServer def init_worker_process(worker) # we''ll re-trap :QUIT later for graceful shutdown iff we accept clients EXIT_SIGS.each { |sig| trap(sig) { exit!(0) } } + exit!(0) if (SIG_QUEUE & EXIT_SIGS)[0] WORKER_QUEUE_SIGS.each { |sig| trap(sig, nil) } trap(:CHLD, ''DEFAULT'') SIG_QUEUE.clear -- Eric Wong
>> Can you add "-f -e ''!futex''" to the strace invocation so we see >> all the threads? ?You can also add ''-T'' to get timing information >> to see how long each syscall takes or ''-tt'' to get timestamps >> in strace if you think it''s useful.Here''s the result of strace with suggested flags on the old master. (To clarify, this is still the spinning CPU issue) ? sudo strace -v -f -e ''!futex'' -p 18862 Process 18862 attached with 2 threads - interrupt to quit [pid 22170] restart_syscall(<... resuming interrupted call ...> <unfinished ...> [pid 18862] select(4, [3], NULL, NULL, {3, 546106}) = 0 (Timeout) [pid 18862] wait4(-1, 0x7fffbb907d3c, WNOHANG, NULL) = 0 [pid 18862] clock_gettime(CLOCK_REALTIME, {1314067031, 459472514}) = 0 [pid 18862] fstat(7, {st_dev=makedev(202, 1), st_ino=31695, st_mode=S_IFREG, st_nlink=0, st_uid=1001, st_gid=1001, st_blksize=4096, st_blocks=0, st_size=0, st_atime=2011/08/23-00:09:05, st_mtime=2011/08/23-00:09:05, st_ctime=2011/08/23-02:37:06}) = 0 [pid 18862] clock_gettime(CLOCK_REALTIME, {1314067031, 459813102}) = 0 [pid 18862] fstat(11, {st_dev=makedev(202, 1), st_ino=31696, st_mode=S_IFREG, st_nlink=0, st_uid=1001, st_gid=1001, st_blksize=4096, st_blocks=0, st_size=0, st_atime=2011/08/23-00:09:05, st_mtime=2011/08/23-00:09:05, st_ctime=2011/08/23-02:37:07}) = 0 [pid 18862] clock_gettime(CLOCK_REALTIME, {1314067031, 459999675}) = 0 [pid 18862] fstat(13, {st_dev=makedev(202, 1), st_ino=31697, st_mode=S_IFREG, st_nlink=0, st_uid=1001, st_gid=1001, st_blksize=4096, st_blocks=0, st_size=0, st_atime=2011/08/23-00:09:05, st_mtime=2011/08/23-00:09:05, st_ctime=2011/08/23-02:37:07}) = 0 [pid 18862] clock_gettime(CLOCK_REALTIME, {1314067031, 460168995}) = 0 [pid 18862] fstat(14, {st_dev=makedev(202, 1), st_ino=31698, st_mode=S_IFREG, st_nlink=0, st_uid=1001, st_gid=1001, st_blksize=4096, st_blocks=0, st_size=0, st_atime=2011/08/23-00:09:05, st_mtime=2011/08/23-00:09:05, st_ctime=2011/08/23-02:37:07}) = 0 [pid 18862] clock_gettime(CLOCK_REALTIME, {1314067031, 460343593}) = 0 [pid 18862] select(4, [3], NULL, NULL, {6, 255959}) = 0 (Timeout) [pid 18862] wait4(-1, 0x7fffbb907d3c, WNOHANG, NULL) = 0 [pid 18862] clock_gettime(CLOCK_REALTIME, {1314067037, 739352608}) = 0 [pid 18862] fstat(7, {st_dev=makedev(202, 1), st_ino=31695, st_mode=S_IFREG, st_nlink=0, st_uid=1001, st_gid=1001, st_blksize=4096, st_blocks=0, st_size=0, st_atime=2011/08/23-00:09:05, st_mtime=2011/08/23-00:09:05, st_ctime=2011/08/23-02:37:16}) = 0 [pid 18862] clock_gettime(CLOCK_REALTIME, {1314067037, 739570659}) = 0 [pid 18862] fstat(11, {st_dev=makedev(202, 1), st_ino=31696, st_mode=S_IFREG, st_nlink=0, st_uid=1001, st_gid=1001, st_blksize=4096, st_blocks=0, st_size=0, st_atime=2011/08/23-00:09:05, st_mtime=2011/08/23-00:09:05, st_ctime=2011/08/23-02:37:17}) = 0 [pid 18862] clock_gettime(CLOCK_REALTIME, {1314067037, 739768755}) = 0 [pid 18862] fstat(13, {st_dev=makedev(202, 1), st_ino=31697, st_mode=S_IFREG|01, st_nlink=0, st_uid=1001, st_gid=1001, st_blksize=4096, st_blocks=0, st_size=0, st_atime=2011/08/23-00:09:05, st_mtime=2011/08/23-00:09:05, st_ctime=2011/08/23-02:37:17}) = 0 [pid 18862] clock_gettime(CLOCK_REALTIME, {1314067037, 739951254}) = 0 [pid 18862] fstat(14, {st_dev=makedev(202, 1), st_ino=31698, st_mode=S_IFREG|01, st_nlink=0, st_uid=1001, st_gid=1001, st_blksize=4096, st_blocks=0, st_size=0, st_atime=2011/08/23-00:09:05, st_mtime=2011/08/23-00:09:05, st_ctime=2011/08/23-02:37:17}) = 0 [pid 18862] clock_gettime(CLOCK_REALTIME, {1314067037, 740130965}) = 0 [pid 18862] select(4, [3], NULL, NULL, {9, 996190}^C <unfinished ...>>> Also, I assume you''re using preload_app? ?If you are, do you see this >> issue with preload_app=false? ?I suspect there''s some background thread >> that could be running in the master process taking up all the CPU time. >> Unicorn itself never spawns background threads.I went ahead and ran strace with the same flags on the *new* master, and saw a bunch of output that looked bundler-related: https://gist.github.com/138344b5b19ec6ba1a4c Even more bizarre, eventually the process started successfully :-/ Is it possible this had something to do with strace de-taching? You can see this in the unicorn.stderr.log file I included in the gist. Check out these two lines in particular, which occur 25 minutes apart: I, [2011-08-23T02:15:08.396868 #22169] INFO -- : Refreshing Gem list I, [2011-08-23T02:40:16.621210 #22925] INFO -- : worker=1 spawned pid=22925 Question: If I turn on debug flags in unicorn, will I get much of this output in the unicorn.stderr log? - Alex
Alex Sharp <ajsharp at gmail.com> wrote:> >> Can you add "-f -e ''!futex''" to the strace invocation so we see > >> all the threads? ?You can also add ''-T'' to get timing information > >> to see how long each syscall takes or ''-tt'' to get timestamps > >> in strace if you think it''s useful. > > Here''s the result of strace with suggested flags on the old master. > (To clarify, this is still the spinning CPU issue) > > ? sudo strace -v -f -e ''!futex'' -p 18862 > Process 18862 attached with 2 threads - interrupt to quit > [pid 22170] restart_syscall(<... resuming interrupted call ...> <unfinished ...> > [pid 18862] select(4, [3], NULL, NULL, {3, 546106}) = 0 (Timeout)<snip> 22170 is presumably the timer thread. Did you send any signals to 18862 while you were making this strace? If you did send a signal to 18862 while stracing, it''s worrying it didn''t show up in your strace. If you didn''t send a signal, then try sending a signal.> >> Also, I assume you''re using preload_app? ?If you are, do you see this > >> issue with preload_app=false? ?I suspect there''s some background thread > >> that could be running in the master process taking up all the CPU time. > >> Unicorn itself never spawns background threads. > > I went ahead and ran strace with the same flags on the *new* master, > and saw a bunch of output that looked bundler-related: > https://gist.github.com/138344b5b19ec6ba1a4c > > Even more bizarre, eventually the process started successfully :-/ Is > it possible this had something to do with strace de-taching?That looks like pretty normal "require" behavior. strace would slow down requires, a lot. So this was with preload_app=true? While you''re debugging problems, I suggest keeping preload_app=false and worker_problems=1 to minimize the variables.> You can see this in the unicorn.stderr.log file I included in the > gist. Check out these two lines in particular, which occur 25 minutes > apart: > > I, [2011-08-23T02:15:08.396868 #22169] INFO -- : Refreshing Gem list > I, [2011-08-23T02:40:16.621210 #22925] INFO -- : worker=1 spawned pid=22925Wow, it takes 25 minutes to load your application? strace makes the application /much/ slower, so I can actually believe it takes that long. Loading large applications is very slow under Ruby 1.9.2, there''s some pathological load performance issues fixed in 1.9.3. So you''re saying /without/ strace, CPU usage _stays_ at 100% and _never_ goes down?> Question: If I turn on debug flags in unicorn, will I get much of this > output in the unicorn.stderr log?You''ll get every exception that''s thrown even if it''s rescued. Unicorn 3.x itself doesn''t throw that many exceptions. Generally, can you reproduce this behavior on a plain (empty) Rails application with no extra gems? -- Eric Wong
On Tuesday, August 23, 2011 at 12:12 AM, Eric Wong wrote:> Did you send any signals to 18862 while you were making this strace?Not while, no. I had already sent a USR2 signal to this process (the old master), and then I attached strace to it. I''ll try sending another USR2 signal next time while strace is attached.> > I went ahead and ran strace with the same flags on the *new* master, > > and saw a bunch of output that looked bundler-related: > > https://gist.github.com/138344b5b19ec6ba1a4c > > > > Even more bizarre, eventually the process started successfully :-/ Is > > it possible this had something to do with strace de-taching? > > That looks like pretty normal "require" behavior. strace would slow > down requires, a lot. > > So this was with preload_app=true? While you''re debugging problems, > I suggest keeping preload_app=false and worker_problems=1 to minimize > the variables.Ok, I''ll change those and report back. I''m guessing you meant worker_processes (not problems)?> > > You can see this in the unicorn.stderr.log file I included in the > > gist. Check out these two lines in particular, which occur 25 minutes > > apart: > > > > I, [2011-08-23T02:15:08.396868 #22169] INFO -- : Refreshing Gem list > > I, [2011-08-23T02:40:16.621210 #22925] INFO -- : worker=1 spawned pid=22925 > > Wow, it takes 25 minutes to load your application? strace makes the > application /much/ slower, so I can actually believe it takes that long.No, my mistake. Loading the application only takes about 10 seconds, and I only had strace attached to this process for a few seconds (less than 10). My point here was to show that the new master just spun for a good 25 minutes (presumably trying to load files over and over again), and then, seemingly out of nowhere, the new master came up and spawned the new workers. Next time I''ll try to get attached with strace earlier and record more output.> > Loading large applications is very slow under Ruby 1.9.2, there''s some > pathological load performance issues fixed in 1.9.3. >Yep, I''ve read about those, and I''ve seen Xavier''s patch, but I don''t think that''s the issue here (though, it appears that''s why the files attempting to be loaded in the strace output do not exist). Under normal circumstances, loading the app takes about 10 seconds and doesn''t peg the cpu while doing it.> > So you''re saying /without/ strace, CPU usage _stays_ at 100% and _never_ > goes down?Correct.> > > Generally, can you reproduce this behavior on a plain (empty) Rails > application with no extra gems? >Good idea, I''ll try that next. - Alex
Alex Sharp <ajsharp at gmail.com> wrote:> On Tuesday, August 23, 2011 at 12:12 AM, Eric Wong wrote: > > So this was with preload_app=true? While you''re debugging problems, > > I suggest keeping preload_app=false and worker_problems=1 to minimize > > the variables. > Ok, I''ll change those and report back. I''m guessing you meant > worker_processes (not problems)?LOL yes, it was late last night :)> > Generally, can you reproduce this behavior on a plain (empty) Rails > > application with no extra gems? > > > Good idea, I''ll try that next.I rarely run into problems, but doing things like this (starting with empty app, Unicorn as default-as-possible (worker_processes=1, preload_app=false)) should be standard procedure for troubleshooting. These things are so second nature to me and I forget to mention it :< I''m also the type who reads the code of /every/ library I introduce into the app so I know (or am very quickly capable) of knowing exactly what an app is doing. -- Eric Wong
Eric Wong <normalperson at yhbt.net> wrote:> + trap(sig) do > + @logger.debug("received SIG#{sig}")I''m even more glad I didn''t apply this patch to Unicorn. I completely forgot Logger uses a mutex internally (even though it doesn''t need to when writing to a POSIX-compliant file system). Rainbows! has a similar issue I fixed/worked around: http://mid.gmane.org/20110830233232.GA19633 at dcvr.yhbt.net -- Eric Wong
On Tuesday, August 30, 2011 at 5:33 PM, Eric Wong wrote:> Eric Wong <normalperson at yhbt.net (mailto:normalperson at yhbt.net)> wrote: > > + trap(sig) do > > + @logger.debug("received SIG#{sig}") > > I''m even more glad I didn''t apply this patch to Unicorn. I completely > forgot Logger uses a mutex internally (even though it doesn''t need to > when writing to a POSIX-compliant file system). > > Rainbows! has a similar issue I fixed/worked around: > http://mid.gmane.org/20110830233232.GA19633 at dcvr.yhbt.netSo are you saying that Unicorn is affected with the same issue? I''m confused -- does the master send USR1 signals to the workers when it receives a USR2? I''m not sending a USR1 signal to the master. - Alex
Alex Sharp <ajsharp at gmail.com> wrote:> On Tuesday, August 30, 2011 at 5:33 PM, Eric Wong wrote: > > Eric Wong <normalperson at yhbt.net (mailto:normalperson at yhbt.net)> wrote: > > > + trap(sig) do > > > + @logger.debug("received SIG#{sig}") > > > > I''m even more glad I didn''t apply this patch to Unicorn. I completely > > forgot Logger uses a mutex internally (even though it doesn''t need to > > when writing to a POSIX-compliant file system). > > > > Rainbows! has a similar issue I fixed/worked around: > > http://mid.gmane.org/20110830233232.GA19633 at dcvr.yhbt.net > > So are you saying that Unicorn is affected with the same issue? I''m > confused -- does the master send USR1 signals to the workers when it > receives a USR2? I''m not sending a USR1 signal to the master.No, Unicorn is not affected. It /would have/ been, had I applied that patch. -- Eric Wong
Also, today I did an strace on an old unicorn master that was not responding to USR2 signals (but not pegging the cpu), and I saw tons of gettimeofday calls in the output. Is this normal? Also, I was seeing this after having removed newrelic from the project. --Alex Sharp github.com/ajsharp twitter.com/ajsharp alexjsharp.com
Alex Sharp <ajsharp at gmail.com> wrote:> Also, today I did an strace on an old unicorn master that was not > responding to USR2 signals (but not pegging the cpu), and I saw tons > of gettimeofday calls in the output. Is this normal? Also, I was > seeing this after having removed newrelic from the project.Yes, if it''s from the same (timer) thread that''s making a bunch of futex() calls (which you''ve probably filtered out) on 1.9.2. They''ll go away on Ruby 1.9.3. On x86_64 and a modern-ish kernel + glibc, gettimeofday() isn''t even a syscall (and won''t show up on strace): http://docs.redhat.com/docs/en-US/Red_Hat_Enterprise_MRG/1.3/html/Realtime_Tuning_Guide/sect-Realtime_Tuning_Guide-General_System_Tuning-gettimeofday_speedup.html