Hi, I''m encountering a weird error where the unicorn workers are stuck in a loop after hitting a 500 on the backend sinatra app. strace at the point where it starts to go into a loop of death select(7, [4 5], NULL, [3 6], {30, 0}) ?= 1 (in [5], left {27, 274382}) fchmod(8, 01) ? ? ? ? ? ? ? ? ? ? ? ? ? = 0 fcntl(5, F_GETFL) ? ? ? ? ? ? ? ? ? ? ? = 0x802 (flags O_RDWR|O_NONBLOCK) accept4(5, {sa_family=AF_INET, sin_port=htons(56728), sin_addr=inet_addr("10.1.1.4")}, [16], SOCK_CLOEXEC) = 12 recvfrom(12, 0x1c99fb0, 16384, 64, 0, 0) = -1 EAGAIN (Resource temporarily unavailable) select(13, [12], NULL, NULL, NULL) ? ? ?= ? ERESTARTNOHAND (To be restarted) --- SIGINT (Interrupt) @ 0 (0) --- rt_sigreturn(0x2) ? ? ? ? ? ? ? ? ? ? ? = -1 EINTR (Interrupted system call) sched_yield() ? ? ? ? ? ? ? ? ? ? ? ? ? = 0 sched_yield() ? ? ? ? ? ? ? ? ? ? ? ? ? = 0 sched_yield() ? ? ? ? ? ? ? ? ? ? ? ? ? = 0 sched_yield() ? ? ? ? ? ? ? ? ? ? ? ? ? = 0 sched_yield() ? ? ? ? ? ? ? ? ? ? ? ? ? = 0 Longer ?strace outputs can be found over at https://gist.github.com/fe4e3172994e5de21317 I close any open db connections in before_fork and reopen connections in after_fork. A bit of research suggests that rb_thread_wait has issues when the select receives?ERESTARTNOHAND,?any ideas as to why this might be happening ? I''m running $ uname -a Linux bbox 2.6.38-02063806-generic #201105121509 SMP Thu May 12 15:14:14 UTC 2011 x86_64 GNU/Linux $ ruby -v ruby 1.9.2p0 (2010-08-18 revision 29036) [x86_64-linux] Thanks
Bharanee Rathna <deepfryed at gmail.com> wrote:> A bit of research suggestsi that rb_thread_wait has > issues when the select receives?ERESTARTNOHAND,?any ideas as to why > this might be happening ?Not sure, could be a bug in Ruby itself, kernel, or glibc. I''ve seen similar reports of this in the past outside of Unicorn but don''t recall ever finding a satisfactory explanation.> I''m running > > $ uname -a > Linux bbox 2.6.38-02063806-generic #201105121509 SMP Thu May 12 > 15:14:14 UTC 2011 x86_64 GNU/Linux > > $ ruby -v > ruby 1.9.2p0 (2010-08-18 revision 29036) [x86_64-linux]Can you try 1.9.2-p180 or Ruby trunk? Or maybe a different version of glibc, too. Do you have any non-standard kernel patches/scheduler configs? -- Eric Wong
> Can you try 1.9.2-p180 or Ruby trunk? ?Or maybe a different version of > glibc, too.upgraded libc6 & ruby and was able to replicate it under libc6 2.12.1-0ubuntu10.2 $ ruby -v ruby 1.9.2p180 (2011-02-18 revision 30909) [x86_64-linux]>?Do you have any non-standard kernel patches/scheduler > configs?No
Bharanee Rathna <deepfryed at gmail.com> wrote:> I''m encountering a weird error where the unicorn workers are stuck in > a loop after hitting a 500 on the backend sinatra app.Also, what extensions are you using in your app?> strace at the point where it starts to go into a loop of death> select(7, [4 5], NULL, [3 6], {30, 0}) ?= 1 (in [5], left {27, 274382}) > fchmod(8, 01) ? ? ? ? ? ? ? ? ? ? ? ? ? = 0 > fcntl(5, F_GETFL) ? ? ? ? ? ? ? ? ? ? ? = 0x802 (flags O_RDWR|O_NONBLOCK) > accept4(5, {sa_family=AF_INET, sin_port=htons(56728), > sin_addr=inet_addr("10.1.1.4")}, [16], SOCK_CLOEXEC) = 12 > recvfrom(12, 0x1c99fb0, 16384, 64, 0, 0) = -1 EAGAIN (Resource > temporarily unavailable)(I''m somewhat more awake, now, haven''t been sleeping much) Two things look off in the line above: 1) recvfrom() isn''t using the MSG_DONTWAIT flag. I know you''re using Linux, so kgio should be using MSG_DONTWAIT to do non-blocking recv... Which versions of unicorn/kgio are you using? 2) TCP_DEFER_ACCEPT should prevent recvfrom() from hitting EAGAIN in the common case under Linux.> select(13, [12], NULL, NULL, NULL) ? ? ?= ? ERESTARTNOHAND (To be restarted) > --- SIGINT (Interrupt) @ 0 (0) --- > rt_sigreturn(0x2) ? ? ? ? ? ? ? ? ? ? ? = -1 EINTR (Interrupted system call)What triggered SIGINT?> sched_yield() ? ? ? ? ? ? ? ? ? ? ? ? ? = 0 > sched_yield() ? ? ? ? ? ? ? ? ? ? ? ? ? = 0 > sched_yield() ? ? ? ? ? ? ? ? ? ? ? ? ? = 0 > sched_yield() ? ? ? ? ? ? ? ? ? ? ? ? ? = 0 > sched_yield() ? ? ? ? ? ? ? ? ? ? ? ? ? = 0 > > Longer ?strace outputs can be found over at > https://gist.github.com/fe4e3172994e5de21317Actually, after many lines of sched_yield() in your gist, I can see it does actually exit the process. Did you kill it with SIGINT? If so, I see nothing wrong... Ruby 1.9 seems to sched_yield a lot during shutdown, but it does eventually finish. -- Eric Wong
thanks for the quick response eric, On Wed, Jun 1, 2011 at 9:48 AM, Eric Wong <normalperson at yhbt.net> wrote:> Bharanee Rathna <deepfryed at gmail.com> wrote: >> I''m encountering a weird error where the unicorn workers are stuck in >> a loop after hitting a 500 on the backend sinatra app. > > Also, what extensions are you using in your app?heaps of em. yajl, swift, rmagick, fastcaptcha, flock, nokogiri & curb. except swift and curb none of the others would be touching the network.>> strace at the point where it starts to go into a loop of death > >> select(7, [4 5], NULL, [3 6], {30, 0}) ?= 1 (in [5], left {27, 274382}) >> fchmod(8, 01) ? ? ? ? ? ? ? ? ? ? ? ? ? = 0 >> fcntl(5, F_GETFL) ? ? ? ? ? ? ? ? ? ? ? = 0x802 (flags O_RDWR|O_NONBLOCK) >> accept4(5, {sa_family=AF_INET, sin_port=htons(56728), >> sin_addr=inet_addr("10.1.1.4")}, [16], SOCK_CLOEXEC) = 12 >> recvfrom(12, 0x1c99fb0, 16384, 64, 0, 0) = -1 EAGAIN (Resource >> temporarily unavailable) > > (I''m somewhat more awake, now, haven''t been sleeping much) > > Two things look off in the line above: > > 1) recvfrom() isn''t using the MSG_DONTWAIT flag. ?I know you''re using > ? Linux, so kgio should be using MSG_DONTWAIT to do non-blocking > ? recv... ?Which versions of unicorn/kgio are you using?using kgio 2.3.2, i''ll upgrade it and give it another try> > 2) TCP_DEFER_ACCEPT should prevent recvfrom() from hitting EAGAIN > ? in the common case under Linux. > >> select(13, [12], NULL, NULL, NULL) ? ? ?= ? ERESTARTNOHAND (To be restarted) >> --- SIGINT (Interrupt) @ 0 (0) --- >> rt_sigreturn(0x2) ? ? ? ? ? ? ? ? ? ? ? = -1 EINTR (Interrupted system call) > > What triggered SIGINT?not sure> > Actually, after many lines of sched_yield() in your gist, I can see it > does actually exit the process. ?Did you kill it with SIGINT? ?If so, I > see nothing wrong...yes i killed it after the worker looked stuck and wasn''t responding for 30s
>> 1) recvfrom() isn''t using the MSG_DONTWAIT flag. ?I know you''re using >> ? Linux, so kgio should be using MSG_DONTWAIT to do non-blocking >> ? recv... ?Which versions of unicorn/kgio are you using? > > using kgio 2.3.2, i''ll upgrade it and give it another tryrepeatable with kgio 2.4.1
Bharanee Rathna <deepfryed at gmail.com> wrote:> On Wed, Jun 1, 2011 at 9:48 AM, Eric Wong <normalperson at yhbt.net> wrote: > > Bharanee Rathna <deepfryed at gmail.com> wrote: > >> I''m encountering a weird error where the unicorn workers are stuck in > >> a loop after hitting a 500 on the backend sinatra app. > > > > Also, what extensions are you using in your app? > > heaps of em. yajl, swift, rmagick, fastcaptcha, flock, nokogiri & > curb. except swift and curb none of the others would be touching the > network.So are any of them hitting Unicorn from a Unicorn worker itself? This only happens when your app hits a 500? I would not write an app that does that, but if you do, be sure to shutdown any open connections on a 500 (or avoid the error in the first place) ...> >> strace at the point where it starts to go into a loop of deathActually, the sched_yield() only started when you hit sent SIGINT.> > What triggered SIGINT? > > not sure > > > > Actually, after many lines of sched_yield() in your gist, I can see it > > does actually exit the process. ?Did you kill it with SIGINT? ?If so, I > > see nothing wrong... > > yes i killed it after the worker looked stuck and wasn''t responding for 30sSo you hit Ctrl-C (which sends SIGINT)? So basically somebody from 10.1.1.4 opened a connection on Unicorn and just let it sit idle there. That somebody at 10.1.1.4 could''ve been the worker that triggered the 500 (and forgot it had open ccnnections) or a third party that made a request but sit idle because it couldn''t handle the 500 which your Unicorn worker sent.>From what I understand so far, it''s not a Unicorn problem, butsomething in your app. Also, you wouldn''t have this problem if nginx is in front of Unicorn since nginx won''t open a connection and sit idle before making a request. -- Eric Wong