Hello, I''m using unicorn since a while, but now I try to run it the first time inside a FreeBSD jail. The initial start of unicorn works fine and it serves all the requests. But if I want to restart it using the USR2 signal, it (more or less) slowly starts using more and more CPU cycles. There is no error message in the logs and it quite hard to reproduce that error. In 1 of 20 tries, unicorn restarts correctly, but in the other cases I have to "kill -9" the process. I haven''t found anything that gives some indication. I''ve tried unicorn version 4.1.1 and 4.2.0. The FreeBSD version is 8.2-STABLE amd64. That my config: --- listen "/home/deploy/staging/unicorn.sock" pid "/home/deploy/staging/unicorn.pid" preload_app true stderr_path "/home/deploy/staging/unicorn.stderr.log" stdout_path "/home/deploy/staging/unicorn.stdout.log" before_fork do |server, worker| old_pid = "#{server.config[:pid]}.oldbin" if old_pid != server.pid begin process_id = File.read(old_pid).to_i puts "sending QUIT to #{process_id}" Process.kill :QUIT, process_id rescue Errno::ENOENT, Errno::ESRCH end end end --- I''ve tried without the before_fork-block, but I think, that''s not the critical part, since it doesn''t reach to point where two master processes exists. There is just the old master consuming all the CPU cycles. Does someone ran into the same problem? Does someone has an idea? Thanks in advance Philipp
Philipp Bruell <Philipp.Bruell at skrill.com> wrote:> stderr_path "/home/deploy/staging/unicorn.stderr.log"<snip>> Does someone ran into the same problem? Does someone has an idea?Tatsuya Ono documented a workaround for jails here (see gist): http://mid.gmane.org/CAHBuKRj09FdxAgzsefJWotexw-7JYZGJMtgUp_dhjPz9VbKD6Q at mail.gmail.com (http://unicorn.bogomips.org/KNOWN_ISSUES.html refers to this link, too) If that didn''t work, maybe checking stderr.log will tell you something more.
On 1/31/12 7:39 PM, normalperson at yhbt.net wrote:>Philipp Bruell <Philipp.Bruell at skrill.com> wrote: >> >> Does someone ran into the same problem? Does someone has an idea? > >Tatsuya Ono documented a workaround for jails here (see gist): >http://mid.gmane.org/CAHBuKRj09FdxAgzsefJWotexw-7JYZGJMtgUp_dhjPz9VbKD6Q at m >ail.gmail.com > >(http://unicorn.bogomips.org/KNOWN_ISSUES.html refers to this link, too)Philipp''s gone afk for the evening, so I''ll take the liberty of replying with what I know ... We tried the fix mentioned above, and it didn''t work. We also tried switching to unix sockets; no joy. (Actually it worked once, then refused to work again.)>If that didn''t work, maybe checking stderr.log will tell you something >more.Nothing shows up in stderr.log It''s as if the master doesn''t even get the -USR2 signal. Or as if whatever it''s sending to stderr is not actually getting to the filesystem... In any case, we don''t see anything. Any further ideas for how to debug would be much appreciated, and my apologies in advance for mixing up any details; Philipp was doing the work on this, not me. -c
Charles Hornberger <Charles.Hornberger at skrill.com> wrote:> On 1/31/12 7:39 PM, normalperson at yhbt.net wrote: > >If that didn''t work, maybe checking stderr.log will tell you something > >more. > > Nothing shows up in stderr.log It''s as if the master doesn''t even get the > -USR2 signal. Or as if whatever it''s sending to stderr is not actually > getting to the filesystem... > > In any case, we don''t see anything.Can you check if the signal is received in the master via truss/dtruss? Do other signals (USR1, HUP, QUIT) work? You might need to enable the equivalent of "-f" for strace (follow child processes/threads) since Ruby 1.9 uses a dedicated thread for receiving signals.> Any further ideas for how to debug would be much appreciated, and my > apologies in advance for mixing up any details; Philipp was doing the work > on this, not me.Also, which version of Ruby are you using? I''m pretty familiar with the 1.9.3 implementation, the earlier 1.9.x releases were messier and noisy wrt signal handling.
Philipp Bruell <Philipp.Bruell at skrill.com> wrote:> First of all, thank you for your fast reply.No problem, depends on the time of day of course :>> The behaviour details Charles described are correct and we are using ruby > version 1.9.3. > > It''s good that you''ve asked for the other signals. I''ve checked them in > particular and it seems that it is a common signal handling problem. The > process freaks out on each of them :-( > > I''ve attached the output of truss -f for a QUIT signal. That signal took a > quite long time to get processed (and took all CPU cycles), but finally > worked.I only saw the output from the master process there, nothing from the worker. It seems like the master is OK, but trying to kill the worker is not. I wonder if it''s related to https://bugs.ruby-lang.org/issues/5240 With the following script, can you try sending SIGQUIT to the parent and see what happens? ------------------------------- 8< ----------------------------- pid = fork do r, w = IO.pipe trap(:QUIT) do puts "SIGQUIT received in child, exiting" w.close end r.read end trap(:QUIT) do puts "SIGQUIT received in parent, killing child" Process.kill(:QUIT, pid) p Process.waitpid2(pid) exit end sleep 1 # wait for child to setup sig handler puts "Child ready on #{pid}, parent on #$$" sleep ---------------------------------------------------------------- If the above fails, try with different variables: * without a jail on the same FreeBSD version/release/patchlevel * Ruby 1.9.2 (which has a different signal handling implementation) * different FreeBSD version * different architecture[1] Mixing either signal handling or fork()-ing in the presence of threads is tricky. Ruby 1.9 uses a dedicated thread internally for signal handling, I wouldn''t be surprised if there''s a bug lingering somewhere in FreeBSD or Ruby... Have you checked the FreeBSD mailing lists/bug trackers? I don''t recall seeing anything other than the aforementioned bug in ruby-core... [1] - I expect there''s ASM involved in signal/threading implementation details, so there''s a chance it''s x86_64-specific...> The output of USR2 signal is too long to attach it to a mail, but at a > first sight, it repeats the following calls over and over again.Don''t send monster attachments, host it somewhere else so mail servers won''t reject it for wasting bandwidth. The mailman limit on rubyforge is apparently 256K (already huge IMHO). Also, don''t top post> 24864: > thr_kill(0x18c32,0x1a,0x800a8edc0,0x18a86,0x7fffffbeaf80,0x80480c000) = 0 > (0x0) > 24864: select(4,{3},0x0,0x0,{0.100000 }) = 1 (0x1) > 24864: read(3,"!",1024) = 1 (0x1)OK, so the signal is received correctly by the Ruby VM in the master. I just don''t see anything in the worker, but the master does attempt to forward SIGQUIT to the worker.> It also seems to me, that observing the processes with truss changes the > behaviour a lot. During the observed USR2 the master process spawns a lot > (about 30) of <defunct> processes. I never had this before.Some processes react strangely to being traced. Maybe there''s something better than truss nowadays?
Philipp Bruell <Philipp.Bruell at skrill.com> wrote:> On 01/02/2012 19:14, "Eric Wong" <normalperson at yhbt.net> wrote: > >Philipp Bruell <Philipp.Bruell at skrill.com> wrote:> The scripts behaves exactly like unicorn. The master received the QUIT and > passes it to the child. The child also receives it, but don''t exit. While > the master is waiting for the child to exit, it consumes all the cpu > cycles.Interesting, I suspect it''s some bad interaction with fork() causing signal handlers/pthreads to go bad. I expect the following simple script to work flawlessly since it doesn''t fork: ---------------------------------------- trap(:QUIT) { exit(0) } puts "Ready for SIGQUIT on #$$" sleep ----------------------------------------> I don''t have the option, to test without jail, on a different FreeBSD > version nor a different architecture (and FreeBSD - on Mac OS X everything > works perfect). But I tried ruby version 1.9.2 and that works! So I guess > it''s a bug with 1.9.3 on FreeBSD.Can you report this as a bug to the Ruby core folks on https://bugs.ruby-lang.org/ and also to whereever the FreeBSD hackers take bug reports? Somebody from one of those camps should be able to resolve the issue. The good thing is my small sample script is enough to reproduce the issue, so it should be easy for an experienced FreeBSD hacker to track down.> I''ve attached the truss -f output of the child process of the test script. > But the observation with truss made the problem disappear again :-(It could be a timing or race condition issue. I''ve had strace on linux find/hide bugs because it slowed the program down enough.
Eric Wong <normalperson at yhbt.net> wrote:> Can you report this as a bug to the Ruby core folks on > https://bugs.ruby-lang.org/ and also to whereever the FreeBSD hackers > take bug reports? Somebody from one of those camps should be able > to resolve the issue.A total stab in the dark, but I posted this patch to ruby-core anyways to find more testers/reviewers: http://mid.gmane.org/20120202221946.GA32004 at dcvr.yhbt.net Mind giving it a shot?
Eric Wong <normalperson at yhbt.net> wrote:> A total stab in the dark, but I posted this patch to ruby-core anyways > to find more testers/reviewers: > > http://mid.gmane.org/20120202221946.GA32004 at dcvr.yhbt.netOops, and I just posted a follow-up since the original was a no-op due to ordering issues :x http://mid.gmane.org/20120202223945.GA9233 at dcvr.yhbt.net
Eric Wong <normalperson at yhbt.net> wrote:> > A total stab in the dark, but I posted this patch to ruby-core anyways > > to find more testers/reviewers:Last attempt at a patch on this issue, I''m just shotgunning here :x http://bogomips.org/ruby.git/patch/?id=418827f4e41a618d91
ruby-core pointed me to the following issue: https://bugs.ruby-lang.org/issues/5757 So there may already be a fix in Ruby SVN, can you test? http://svn.ruby-lang.org/repos/ruby/branches/ruby_1_9_3 r34425
Eric Wong <normalperson at yhbt.net> wrote:> ruby-core pointed me to the following issue: > https://bugs.ruby-lang.org/issues/5757 > > So there may already be a fix in Ruby SVN, can you test? > http://svn.ruby-lang.org/repos/ruby/branches/ruby_1_9_3 r34425Btw, did anybody get a chance to try this? While working on an unrelated project, I setup FreeBSD 8.2 and 9.0 KVM images over the weekend. Since I had the KVM images handy, I also tried to reproduce this issue under 1.9.3-p0 (without a jail) but was unable to reproduce the issue under either 8.2 nor 9.0. I tried building a jail, but didn''t have enough space for a full one. If I get the chance, I''ll see how building a partial jail goes, but I''m not optimistic about being able to reproduce this issue under KVM since it seems to be a race condition/timing issue. I even had both CPU cores enabled under KVM and even installed the virtio drivers for better performance. I assume you guys are using SMP in the jail?
Hi, Sorry for my late rely. On 07/02/2012 07:21, "Eric Wong" <normalperson at yhbt.net> wrote:>Eric Wong <normalperson at yhbt.net> wrote: >> ruby-core pointed me to the following issue: >> https://bugs.ruby-lang.org/issues/5757 >> >> So there may already be a fix in Ruby SVN, can you test? >> http://svn.ruby-lang.org/repos/ruby/branches/ruby_1_9_3 r34425 > >Btw, did anybody get a chance to try this?I haven''t tried it yet. Currently, we are using RVM to install ruby, but as soon as I''ve some time, I''ll setup a source version of ruby and apply some of these patches.> >While working on an unrelated project, I setup FreeBSD 8.2 and 9.0 >KVM images over the weekend. Since I had the KVM images handy, I also >tried to reproduce this issue under 1.9.3-p0 (without a jail) but was >unable to reproduce the issue under either 8.2 nor 9.0.Yeah - I''ve also ask a college to test it under FreeBSD (without Jail) and he can''t reproduce it either. It seems to be a Jail problem (I would run in cycles too, if I would be in Jail ;-).>I tried building a jail, but didn''t have enough space for a full one. >If I get the chance, I''ll see how building a partial jail goes, but I''m >not optimistic about being able to reproduce this issue under KVM >since it seems to be a race condition/timing issue. > >I even had both CPU cores enabled under KVM and even installed the >virtio drivers for better performance. > >I assume you guys are using SMP in the jail?Yes - I''m pretty sure that SMP is involved. Kind regards Philipp
On 2/7/12 9:36 AM, Philipp.Bruell at skrill.com wrote:>Yes - I''m pretty sure that SMP is involved.The Jail is on a (big) SMP box, and we definitely have access to all the CPUs. And since the process consumes 200% of cpu according to ps, it seems clear that at least 2 CPUs are involved...