thr3ads.net - mongrel unicorn - unicorn stuck in sched_yield after ERESTARTNOHAND [May 2011]

If this information is useful, please help other people find it:
Share via:

Bharanee Rathna

2011-May-31 12:02 UTC

unicorn stuck in sched_yield after ERESTARTNOHAND

Hi,

I''m encountering a weird error where the unicorn workers are stuck in
a loop after hitting a 500 on the backend sinatra app.

strace at the point where it starts to go into a loop of death


select(7, [4 5], NULL, [3 6], {30, 0}) ?= 1 (in [5], left {27, 274382})
fchmod(8, 01) ? ? ? ? ? ? ? ? ? ? ? ? ? = 0
fcntl(5, F_GETFL) ? ? ? ? ? ? ? ? ? ? ? = 0x802 (flags O_RDWR|O_NONBLOCK)
accept4(5, {sa_family=AF_INET, sin_port=htons(56728),
sin_addr=inet_addr("10.1.1.4")}, [16], SOCK_CLOEXEC) = 12
recvfrom(12, 0x1c99fb0, 16384, 64, 0, 0) = -1 EAGAIN (Resource
temporarily unavailable)
select(13, [12], NULL, NULL, NULL) ? ? ?= ? ERESTARTNOHAND (To be restarted)
--- SIGINT (Interrupt) @ 0 (0) ---
rt_sigreturn(0x2) ? ? ? ? ? ? ? ? ? ? ? = -1 EINTR (Interrupted system call)
sched_yield() ? ? ? ? ? ? ? ? ? ? ? ? ? = 0
sched_yield() ? ? ? ? ? ? ? ? ? ? ? ? ? = 0
sched_yield() ? ? ? ? ? ? ? ? ? ? ? ? ? = 0
sched_yield() ? ? ? ? ? ? ? ? ? ? ? ? ? = 0
sched_yield() ? ? ? ? ? ? ? ? ? ? ? ? ? = 0

Longer ?strace outputs can be found over at
https://gist.github.com/fe4e3172994e5de21317
I close any open db connections in before_fork and reopen connections
in after_fork. A bit of research suggests that rb_thread_wait has
issues when the select receives?ERESTARTNOHAND,?any ideas as to why
this might be happening ?

I''m running

$ uname -a
Linux bbox 2.6.38-02063806-generic #201105121509 SMP Thu May 12
15:14:14 UTC 2011 x86_64 GNU/Linux

$ ruby -v
ruby 1.9.2p0 (2010-08-18 revision 29036) [x86_64-linux]

Thanks

Eric Wong

2011-May-31 15:17 UTC

head link

unicorn stuck in sched_yield after ERESTARTNOHAND

Bharanee Rathna <deepfryed at gmail.com> wrote:> A bit of research suggestsi that rb_thread_wait has
> issues when the select receives?ERESTARTNOHAND,?any ideas as to why
> this might be happening ?
Not sure, could be a bug in Ruby itself, kernel, or glibc.

I''ve seen similar reports of this in the past outside of Unicorn but
don''t recall ever finding a satisfactory explanation.
> I''m running
> 
> $ uname -a
> Linux bbox 2.6.38-02063806-generic #201105121509 SMP Thu May 12
> 15:14:14 UTC 2011 x86_64 GNU/Linux
> 
> $ ruby -v
> ruby 1.9.2p0 (2010-08-18 revision 29036) [x86_64-linux]
Can you try 1.9.2-p180 or Ruby trunk?  Or maybe a different version of
glibc, too.  Do you have any non-standard kernel patches/scheduler
configs?

-- 
Eric Wong

Bharanee Rathna

2011-May-31 22:28 UTC

head link

unicorn stuck in sched_yield after ERESTARTNOHAND

> Can you try 1.9.2-p180 or Ruby trunk? ?Or maybe a different version of
> glibc, too.
upgraded libc6 & ruby and was able to replicate it under libc6
2.12.1-0ubuntu10.2

$ ruby -v
ruby 1.9.2p180 (2011-02-18 revision 30909) [x86_64-linux]

>?Do you have any non-standard kernel patches/scheduler
> configs?
No

Eric Wong

2011-May-31 23:48 UTC

head link

unicorn stuck in sched_yield after ERESTARTNOHAND

Bharanee Rathna <deepfryed at gmail.com> wrote:> I''m encountering a weird error where the unicorn workers are stuck
in
> a loop after hitting a 500 on the backend sinatra app.
Also, what extensions are you using in your app?
> strace at the point where it starts to go into a loop of death
> select(7, [4 5], NULL, [3 6], {30, 0}) ?= 1 (in [5], left {27, 274382})
> fchmod(8, 01) ? ? ? ? ? ? ? ? ? ? ? ? ? = 0
> fcntl(5, F_GETFL) ? ? ? ? ? ? ? ? ? ? ? = 0x802 (flags O_RDWR|O_NONBLOCK)
> accept4(5, {sa_family=AF_INET, sin_port=htons(56728),
> sin_addr=inet_addr("10.1.1.4")}, [16], SOCK_CLOEXEC) = 12
> recvfrom(12, 0x1c99fb0, 16384, 64, 0, 0) = -1 EAGAIN (Resource
> temporarily unavailable)
(I''m somewhat more awake, now, haven''t been sleeping much)

Two things look off in the line above:

1) recvfrom() isn''t using the MSG_DONTWAIT flag.  I know
you''re using
   Linux, so kgio should be using MSG_DONTWAIT to do non-blocking
   recv...  Which versions of unicorn/kgio are you using?

2) TCP_DEFER_ACCEPT should prevent recvfrom() from hitting EAGAIN
   in the common case under Linux.
> select(13, [12], NULL, NULL, NULL) ? ? ?= ? ERESTARTNOHAND (To be
restarted)
> --- SIGINT (Interrupt) @ 0 (0) ---
> rt_sigreturn(0x2) ? ? ? ? ? ? ? ? ? ? ? = -1 EINTR (Interrupted system
call)
What triggered SIGINT?
> sched_yield() ? ? ? ? ? ? ? ? ? ? ? ? ? = 0
> sched_yield() ? ? ? ? ? ? ? ? ? ? ? ? ? = 0
> sched_yield() ? ? ? ? ? ? ? ? ? ? ? ? ? = 0
> sched_yield() ? ? ? ? ? ? ? ? ? ? ? ? ? = 0
> sched_yield() ? ? ? ? ? ? ? ? ? ? ? ? ? = 0
> 
> Longer ?strace outputs can be found over at
> https://gist.github.com/fe4e3172994e5de21317
Actually, after many lines of sched_yield() in your gist, I can see it
does actually exit the process.  Did you kill it with SIGINT?  If so, I
see nothing wrong...

Ruby 1.9 seems to sched_yield a lot during shutdown, but it does
eventually finish.

-- 
Eric Wong

Bharanee Rathna

2011-Jun-01 00:31 UTC

head link

unicorn stuck in sched_yield after ERESTARTNOHAND

thanks for the quick response eric,

On Wed, Jun 1, 2011 at 9:48 AM, Eric Wong <normalperson at yhbt.net>
wrote:> Bharanee Rathna <deepfryed at gmail.com> wrote:
>> I''m encountering a weird error where the unicorn workers are
stuck in
>> a loop after hitting a 500 on the backend sinatra app.
>
> Also, what extensions are you using in your app?
heaps of em. yajl, swift, rmagick, fastcaptcha, flock, nokogiri &
curb.  except swift and curb none of the others would be touching the
network.
>> strace at the point where it starts to go into a loop of death
>
>> select(7, [4 5], NULL, [3 6], {30, 0}) ?= 1 (in [5], left {27, 274382})
>> fchmod(8, 01) ? ? ? ? ? ? ? ? ? ? ? ? ? = 0
>> fcntl(5, F_GETFL) ? ? ? ? ? ? ? ? ? ? ? = 0x802 (flags
O_RDWR|O_NONBLOCK)
>> accept4(5, {sa_family=AF_INET, sin_port=htons(56728),
>> sin_addr=inet_addr("10.1.1.4")}, [16], SOCK_CLOEXEC) = 12
>> recvfrom(12, 0x1c99fb0, 16384, 64, 0, 0) = -1 EAGAIN (Resource
>> temporarily unavailable)
>
> (I''m somewhat more awake, now, haven''t been sleeping
much)
>
> Two things look off in the line above:
>
> 1) recvfrom() isn''t using the MSG_DONTWAIT flag. ?I know
you''re using
> ? Linux, so kgio should be using MSG_DONTWAIT to do non-blocking
> ? recv... ?Which versions of unicorn/kgio are you using?
using kgio 2.3.2, i''ll upgrade it and give it another try
>
> 2) TCP_DEFER_ACCEPT should prevent recvfrom() from hitting EAGAIN
> ? in the common case under Linux.
>
>> select(13, [12], NULL, NULL, NULL) ? ? ?= ? ERESTARTNOHAND (To be
restarted)
>> --- SIGINT (Interrupt) @ 0 (0) ---
>> rt_sigreturn(0x2) ? ? ? ? ? ? ? ? ? ? ? = -1 EINTR (Interrupted system
call)
>
> What triggered SIGINT?
not sure
>
> Actually, after many lines of sched_yield() in your gist, I can see it
> does actually exit the process. ?Did you kill it with SIGINT? ?If so, I
> see nothing wrong...
yes i killed it after the worker looked stuck and wasn''t responding for
30s

Bharanee Rathna

2011-Jun-01 00:44 UTC

head link

unicorn stuck in sched_yield after ERESTARTNOHAND

>> 1) recvfrom() isn''t using the MSG_DONTWAIT flag. ?I know
you''re using
>> ? Linux, so kgio should be using MSG_DONTWAIT to do non-blocking
>> ? recv... ?Which versions of unicorn/kgio are you using?
>
> using kgio 2.3.2, i''ll upgrade it and give it another try
repeatable with kgio 2.4.1

Eric Wong

2011-Jun-01 16:48 UTC

head link

unicorn stuck in sched_yield after ERESTARTNOHAND

Bharanee Rathna <deepfryed at gmail.com> wrote:> On Wed, Jun 1, 2011 at 9:48 AM, Eric Wong <normalperson at yhbt.net>
wrote:
> > Bharanee Rathna <deepfryed at gmail.com> wrote:
> >> I''m encountering a weird error where the unicorn workers
are stuck in
> >> a loop after hitting a 500 on the backend sinatra app.
> >
> > Also, what extensions are you using in your app?
> 
> heaps of em. yajl, swift, rmagick, fastcaptcha, flock, nokogiri &
> curb.  except swift and curb none of the others would be touching the
> network.
So are any of them hitting Unicorn from a Unicorn worker itself?
This only happens when your app hits a 500?  I would not write
an app that does that, but if you do, be sure to shutdown any
open connections on a 500 (or avoid the error in the first place) ...
> >> strace at the point where it starts to go into a loop of death
Actually, the sched_yield() only started when you hit sent SIGINT.
> > What triggered SIGINT?
> 
> not sure
> >
> > Actually, after many lines of sched_yield() in your gist, I can see it
> > does actually exit the process. ?Did you kill it with SIGINT? ?If so,
I
> > see nothing wrong...
> 
> yes i killed it after the worker looked stuck and wasn''t
responding for 30s
So you hit Ctrl-C (which sends SIGINT)?

So basically somebody from 10.1.1.4 opened a connection on Unicorn
and just let it sit idle there.  That somebody at 10.1.1.4 could''ve
been the worker that triggered the 500 (and forgot it had open
ccnnections)  or a third party that made a request but sit idle
because it couldn''t handle the 500 which your Unicorn worker sent.
>From what I understand so far, it''s not a Unicorn problem, butsomething in your app.

Also, you wouldn''t have this problem if nginx is in front of Unicorn
since nginx won''t open a connection and sit idle before making a
request.

-- 
Eric Wong

mongrel unicorn - May 2011 - unicorn stuck in sched_yield after ERESTARTNOHAND

unicorn stuck in sched_yield after ERESTARTNOHAND

unicorn stuck in sched_yield after ERESTARTNOHAND

unicorn stuck in sched_yield after ERESTARTNOHAND

unicorn stuck in sched_yield after ERESTARTNOHAND

unicorn stuck in sched_yield after ERESTARTNOHAND

unicorn stuck in sched_yield after ERESTARTNOHAND

unicorn stuck in sched_yield after ERESTARTNOHAND