Amanda Waite
2009-Jan-20 03:22 UTC
[Eventmachine-talk] thin occasionally aborts in eventmachine on
I''ve been testing with Thin 1.0.0 on OpenSolaris 2008.11. Every now and again and particularly under load, a thin will abort in eventmachine and will drop core. The error is caused by an assertion that in a call to ed->Write() there should be some data to write and there isn''t. The stack trace is at the end of this message. I''ve only been able to recreate this with a single thin, with about 35-40 concurrent users. The App is a Rails v2.1.0 application (http://incubator.apache.org/olio/). The error is: Assertion failed: nbytes > 0, file ed.cpp, line 622 Abort (core dumped) At the point that the failure happens, on entering ConnectionDescriptor::_WriteOutboundData(), OutboundPages.size() = 0, although in the only test where it aborted while I was monitoring it, OutboundDataSize was 16k The gem is built with g++ and I''ve tried turning down the optimisation from -O3 to none but it made no difference. Any ideas or suggestion on how to dig deeper into the problem? Thanks Amanda, Stack Trace: root at dn21:/export/faban/olio_rails/olioapp# pstack core core ''core'' of 15660: /usr/ruby/1.8/bin/ruby /var/ruby/1.8/gem_home/bin/thin -e production s fecb17c5 _lwp_kill (1, 6, 8034820, fec5a02e) + 15 fec5a03a raise (6, 0, 8034870, fec315ea) + 22 fec3160a abort (65737341, 6f697472, 6166206e, 64656c69, 626e203a, 73657479) + f2 fec3185a _assert (fea9225d, fea92220, 26e, fea830d6) + 82 fea832bf _ZN20ConnectionDescriptor18_WriteOutboundDataEv (a1abc40, 8038b18, 8038bdc, fea82fd1) + 1f9 fea830b8 _ZN20ConnectionDescriptor5WriteEv (a1abc40, 28, 8038b60, fea89597) + f4 fea89869 _ZN14EventMachine_t14_RunSelectOnceEv (9aa48c8, 85106, fecb0615, fea89254) + 2e1 fea892a4 _ZN14EventMachine_t8_RunOnceEv (9aa48c8, 8039254, 0, fea891d8) + 5c fea8922f _ZN14EventMachine_t3RunEv (9aa48c8, 808a254, 2, fea72745) + 63 fea7283e evma_run_machine (9639760, 808a254, 80391fc, 80681fc, 83b15c0, 808a254) + 106 fea7796c _Z29t_run_machine_without_threadsm (83b15c0) + 18 080681fc rb_call0 (83b1480, 83b15c0, 5161, 5161, 0, 0) + 998 08068dd2 rb_call (83b1480, 83b15c0, 5161, 0, 0, 2) + 196 080627f9 rb_eval (83b15c0, 83c71b8) + 19ed 08063241 rb_eval (83b15c0, 83c7654) + 2435 080686ca rb_call0 (83b1480, 83b15c0, 1411, 1411, 0, 0) + e66 08068dd2 rb_call (83b1480, 83b15c0, 1411, 0, 0, 0) + 196 08062c99 rb_eval (842937c, 8426f78) + 1e8d 0806ea95 block_pass (842937c, 8426fa0) + 3a1 080636c8 rb_eval (842937c, 8427158) + 28bc 080686ca rb_call0 (84249e4, 842937c, 13c1, 13c1, 0, 0) + e66 08068dd2 rb_call (84249e4, 842937c, 13c1, 0, 0, 0) + 196 08062c99 rb_eval (84297f0, 82c1c28) + 1e8d 080686ca rb_call0 (82b9ba4, 84297f0, 13c1, 13c1, 0, 0) + e66 08068dd2 rb_call (82b9ba4, 84297f0, 13c1, 0, 0, 0) + 196 08062c99 rb_eval (8433098, 842fde4) + 1e8d 080686ca rb_call0 (8429840, 8433098, 13c1, 13c1, 0, 8041464) + e66 08068dd2 rb_call (8429840, 8433098, 13c1, 0, 8041464, 1) + 196 08068fbc rb_f_send (1, 8041460, 8433098) + f4 0806820d rb_call0 (80c8bb0, 8433098, fd1, fd1, 1, 8041460) + 9a9 08068dd2 rb_call (80c8bb0, 8433098, fd1, 1, 8041460, 0) + 196 08062c99 rb_eval (82ce4b4, 8360abc) + 1e8d 0806344a rb_eval (82ce4b4, 83612dc) + 263e 080686ca rb_call0 (82cda78, 82ce4b4, 6351, 6351, 0, 0) + e66 08068dd2 rb_call (82cda78, 82ce4b4, 6351, 0, 0, 2) + 196 080627f9 rb_eval (82ce4b4, 8361534) + 19ed 080686ca rb_call0 (82cda78, 82ce4b4, 4ecf, 4ecf, 0, 0) + e66 08068dd2 rb_call (82cda78, 82ce4b4, 4ecf, 0, 0, 0) + 196 08062c99 rb_eval (80c78b4, 83ec9a4) + 1e8d 0805dfeb eval_node (80c78b4, 83ec9a4) + 3f 0806a819 rb_load (83ecbd4, 0) + 391 0806aaf4 rb_f_load (1, 80469e0, 80c78b4) + 48 0806820d rb_call0 (80c8bb0, 80c78b4, 25c1, 25c1, 1, 80469e0) + 9a9 08068dd2 rb_call (80c8bb0, 80c78b4, 25c1, 1, 80469e0, 1) + 196 08062a39 rb_eval (80c78b4, 80b761c) + 1c2d 0805dfeb eval_node (80c78b4, 80b761c) + 3f 0805e6d6 ruby_exec_internal (808a254, 80c8c50, 0, 0, 0, a00000) + d6 0805e74b ruby_exec (808a254, 8058cbe, 8047d4c, 8058cc6, feffb7e4, feffde70) + 27 0805e773 ruby_run (feffb7e4, feffde70, 0, 8047d4c, 8076d8d, feffb7e4) + 23 08058cc6 main (5, 8047d74, 8047d8c) + 3a 08058bfe _start (5, 8047e30, 8047e47, 8047e67, 8047e6a, 8047e75) + 7a
Roger Pack
2009-Jan-20 06:22 UTC
[Eventmachine-talk] thin occasionally aborts in eventmachine on
I used to run into that all the time on OS X under load, I believe. I never did get to the bottom of it, though you might be able to get away with taking out that assertion line. -=r On Tue, Jan 20, 2009 at 8:22 AM, Amanda Waite <Amanda.Waite at sun.com> wrote:> I''ve been testing with Thin 1.0.0 on OpenSolaris 2008.11. Every now and > again and particularly under load, a thin will abort in eventmachine and > will drop core. The error is caused by an assertion that in a call to > ed->Write() there should be some data to write and there isn''t. The stack > trace is at the end of this message. I''ve only been able to recreate this > with a single thin, with about 35-40 concurrent users. The App is a Rails > v2.1.0 application (http://incubator.apache.org/olio/). > > The error is: > > Assertion failed: nbytes > 0, file ed.cpp, line 622 > Abort (core dumped) > > At the point that the failure happens, on entering > ConnectionDescriptor::_WriteOutboundData(), OutboundPages.size() = 0, > although in the only test where it aborted while I was monitoring it, > OutboundDataSize was 16k > > The gem is built with g++ and I''ve tried turning down the optimisation from > -O3 to none but it made no difference. > > Any ideas or suggestion on how to dig deeper into the problem? > > Thanks > > Amanda, > > Stack Trace: > > > root at dn21:/export/faban/olio_rails/olioapp# pstack core > core ''core'' of 15660: /usr/ruby/1.8/bin/ruby > /var/ruby/1.8/gem_home/bin/thin -e production s > fecb17c5 _lwp_kill (1, 6, 8034820, fec5a02e) + 15 > fec5a03a raise (6, 0, 8034870, fec315ea) + 22 > fec3160a abort (65737341, 6f697472, 6166206e, 64656c69, 626e203a, > 73657479) + f2 > fec3185a _assert (fea9225d, fea92220, 26e, fea830d6) + 82 > fea832bf _ZN20ConnectionDescriptor18_WriteOutboundDataEv (a1abc40, 8038b18, > 8038bdc, fea82fd1) + 1f9 > fea830b8 _ZN20ConnectionDescriptor5WriteEv (a1abc40, 28, 8038b60, fea89597) > + f4 > fea89869 _ZN14EventMachine_t14_RunSelectOnceEv (9aa48c8, 85106, fecb0615, > fea89254) + 2e1 > fea892a4 _ZN14EventMachine_t8_RunOnceEv (9aa48c8, 8039254, 0, fea891d8) + 5c > fea8922f _ZN14EventMachine_t3RunEv (9aa48c8, 808a254, 2, fea72745) + 63 > fea7283e evma_run_machine (9639760, 808a254, 80391fc, 80681fc, 83b15c0, > 808a254) + 106 > fea7796c _Z29t_run_machine_without_threadsm (83b15c0) + 18 > 080681fc rb_call0 (83b1480, 83b15c0, 5161, 5161, 0, 0) + 998 > 08068dd2 rb_call (83b1480, 83b15c0, 5161, 0, 0, 2) + 196 > 080627f9 rb_eval (83b15c0, 83c71b8) + 19ed > 08063241 rb_eval (83b15c0, 83c7654) + 2435 > 080686ca rb_call0 (83b1480, 83b15c0, 1411, 1411, 0, 0) + e66 > 08068dd2 rb_call (83b1480, 83b15c0, 1411, 0, 0, 0) + 196 > 08062c99 rb_eval (842937c, 8426f78) + 1e8d > 0806ea95 block_pass (842937c, 8426fa0) + 3a1 > 080636c8 rb_eval (842937c, 8427158) + 28bc > 080686ca rb_call0 (84249e4, 842937c, 13c1, 13c1, 0, 0) + e66 > 08068dd2 rb_call (84249e4, 842937c, 13c1, 0, 0, 0) + 196 > 08062c99 rb_eval (84297f0, 82c1c28) + 1e8d > 080686ca rb_call0 (82b9ba4, 84297f0, 13c1, 13c1, 0, 0) + e66 > 08068dd2 rb_call (82b9ba4, 84297f0, 13c1, 0, 0, 0) + 196 > 08062c99 rb_eval (8433098, 842fde4) + 1e8d > 080686ca rb_call0 (8429840, 8433098, 13c1, 13c1, 0, 8041464) + e66 > 08068dd2 rb_call (8429840, 8433098, 13c1, 0, 8041464, 1) + 196 > 08068fbc rb_f_send (1, 8041460, 8433098) + f4 > 0806820d rb_call0 (80c8bb0, 8433098, fd1, fd1, 1, 8041460) + 9a9 > 08068dd2 rb_call (80c8bb0, 8433098, fd1, 1, 8041460, 0) + 196 > 08062c99 rb_eval (82ce4b4, 8360abc) + 1e8d > 0806344a rb_eval (82ce4b4, 83612dc) + 263e > 080686ca rb_call0 (82cda78, 82ce4b4, 6351, 6351, 0, 0) + e66 > 08068dd2 rb_call (82cda78, 82ce4b4, 6351, 0, 0, 2) + 196 > 080627f9 rb_eval (82ce4b4, 8361534) + 19ed > 080686ca rb_call0 (82cda78, 82ce4b4, 4ecf, 4ecf, 0, 0) + e66 > 08068dd2 rb_call (82cda78, 82ce4b4, 4ecf, 0, 0, 0) + 196 > 08062c99 rb_eval (80c78b4, 83ec9a4) + 1e8d > 0805dfeb eval_node (80c78b4, 83ec9a4) + 3f > 0806a819 rb_load (83ecbd4, 0) + 391 > 0806aaf4 rb_f_load (1, 80469e0, 80c78b4) + 48 > 0806820d rb_call0 (80c8bb0, 80c78b4, 25c1, 25c1, 1, 80469e0) + 9a9 > 08068dd2 rb_call (80c8bb0, 80c78b4, 25c1, 1, 80469e0, 1) + 196 > 08062a39 rb_eval (80c78b4, 80b761c) + 1c2d > 0805dfeb eval_node (80c78b4, 80b761c) + 3f > 0805e6d6 ruby_exec_internal (808a254, 80c8c50, 0, 0, 0, a00000) + d6 > 0805e74b ruby_exec (808a254, 8058cbe, 8047d4c, 8058cc6, feffb7e4, feffde70) > + 27 > 0805e773 ruby_run (feffb7e4, feffde70, 0, 8047d4c, 8076d8d, feffb7e4) + 23 > 08058cc6 main (5, 8047d74, 8047d8c) + 3a > 08058bfe _start (5, 8047e30, 8047e47, 8047e67, 8047e6a, 8047e75) + 7a > > _______________________________________________ > Eventmachine-talk mailing list > Eventmachine-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/eventmachine-talk >-- 2 Timothy 1:7
Aman Gupta
2009-Jan-20 07:24 UTC
[Eventmachine-talk] thin occasionally aborts in eventmachine on
Hmm, usually this pops up on OSX when using kqueue, but it looks like you''re using the select() reactor. I''m not quite sure why this is happening.. the fd should only be passed to select() if bConnectPending, or OutboundPages.size()>0. Since OutboundPages.size()==0, bConnectPending must be set, but then Write() wouldn''t have called _WriteOutboundData(). Can you check what the value of bConnectPending is when the error occurs? You could also just put in a check at the beginning of _WriteOutboundData(): if (OutboundPages.size() == 0) return; Aman On Tue, Jan 20, 2009 at 7:22 AM, Amanda Waite <Amanda.Waite at sun.com> wrote:> I''ve been testing with Thin 1.0.0 on OpenSolaris 2008.11. Every now and > again and particularly under load, a thin will abort in eventmachine and > will drop core. The error is caused by an assertion that in a call to > ed->Write() there should be some data to write and there isn''t. The stack > trace is at the end of this message. I''ve only been able to recreate this > with a single thin, with about 35-40 concurrent users. The App is a Rails > v2.1.0 application (http://incubator.apache.org/olio/). > > The error is: > > Assertion failed: nbytes > 0, file ed.cpp, line 622 > Abort (core dumped) > > At the point that the failure happens, on entering > ConnectionDescriptor::_WriteOutboundData(), OutboundPages.size() = 0, > although in the only test where it aborted while I was monitoring it, > OutboundDataSize was 16k > > The gem is built with g++ and I''ve tried turning down the optimisation from > -O3 to none but it made no difference. > > Any ideas or suggestion on how to dig deeper into the problem? > > Thanks > > Amanda, > > Stack Trace: > > > root at dn21:/export/faban/olio_rails/olioapp# pstack core > core ''core'' of 15660: /usr/ruby/1.8/bin/ruby > /var/ruby/1.8/gem_home/bin/thin -e production s > fecb17c5 _lwp_kill (1, 6, 8034820, fec5a02e) + 15 > fec5a03a raise (6, 0, 8034870, fec315ea) + 22 > fec3160a abort (65737341, 6f697472, 6166206e, 64656c69, 626e203a, > 73657479) + f2 > fec3185a _assert (fea9225d, fea92220, 26e, fea830d6) + 82 > fea832bf _ZN20ConnectionDescriptor18_WriteOutboundDataEv (a1abc40, 8038b18, > 8038bdc, fea82fd1) + 1f9 > fea830b8 _ZN20ConnectionDescriptor5WriteEv (a1abc40, 28, 8038b60, fea89597) > + f4 > fea89869 _ZN14EventMachine_t14_RunSelectOnceEv (9aa48c8, 85106, fecb0615, > fea89254) + 2e1 > fea892a4 _ZN14EventMachine_t8_RunOnceEv (9aa48c8, 8039254, 0, fea891d8) + 5c > fea8922f _ZN14EventMachine_t3RunEv (9aa48c8, 808a254, 2, fea72745) + 63 > fea7283e evma_run_machine (9639760, 808a254, 80391fc, 80681fc, 83b15c0, > 808a254) + 106 > fea7796c _Z29t_run_machine_without_threadsm (83b15c0) + 18 > 080681fc rb_call0 (83b1480, 83b15c0, 5161, 5161, 0, 0) + 998 > 08068dd2 rb_call (83b1480, 83b15c0, 5161, 0, 0, 2) + 196 > 080627f9 rb_eval (83b15c0, 83c71b8) + 19ed > 08063241 rb_eval (83b15c0, 83c7654) + 2435 > 080686ca rb_call0 (83b1480, 83b15c0, 1411, 1411, 0, 0) + e66 > 08068dd2 rb_call (83b1480, 83b15c0, 1411, 0, 0, 0) + 196 > 08062c99 rb_eval (842937c, 8426f78) + 1e8d > 0806ea95 block_pass (842937c, 8426fa0) + 3a1 > 080636c8 rb_eval (842937c, 8427158) + 28bc > 080686ca rb_call0 (84249e4, 842937c, 13c1, 13c1, 0, 0) + e66 > 08068dd2 rb_call (84249e4, 842937c, 13c1, 0, 0, 0) + 196 > 08062c99 rb_eval (84297f0, 82c1c28) + 1e8d > 080686ca rb_call0 (82b9ba4, 84297f0, 13c1, 13c1, 0, 0) + e66 > 08068dd2 rb_call (82b9ba4, 84297f0, 13c1, 0, 0, 0) + 196 > 08062c99 rb_eval (8433098, 842fde4) + 1e8d > 080686ca rb_call0 (8429840, 8433098, 13c1, 13c1, 0, 8041464) + e66 > 08068dd2 rb_call (8429840, 8433098, 13c1, 0, 8041464, 1) + 196 > 08068fbc rb_f_send (1, 8041460, 8433098) + f4 > 0806820d rb_call0 (80c8bb0, 8433098, fd1, fd1, 1, 8041460) + 9a9 > 08068dd2 rb_call (80c8bb0, 8433098, fd1, 1, 8041460, 0) + 196 > 08062c99 rb_eval (82ce4b4, 8360abc) + 1e8d > 0806344a rb_eval (82ce4b4, 83612dc) + 263e > 080686ca rb_call0 (82cda78, 82ce4b4, 6351, 6351, 0, 0) + e66 > 08068dd2 rb_call (82cda78, 82ce4b4, 6351, 0, 0, 2) + 196 > 080627f9 rb_eval (82ce4b4, 8361534) + 19ed > 080686ca rb_call0 (82cda78, 82ce4b4, 4ecf, 4ecf, 0, 0) + e66 > 08068dd2 rb_call (82cda78, 82ce4b4, 4ecf, 0, 0, 0) + 196 > 08062c99 rb_eval (80c78b4, 83ec9a4) + 1e8d > 0805dfeb eval_node (80c78b4, 83ec9a4) + 3f > 0806a819 rb_load (83ecbd4, 0) + 391 > 0806aaf4 rb_f_load (1, 80469e0, 80c78b4) + 48 > 0806820d rb_call0 (80c8bb0, 80c78b4, 25c1, 25c1, 1, 80469e0) + 9a9 > 08068dd2 rb_call (80c8bb0, 80c78b4, 25c1, 1, 80469e0, 1) + 196 > 08062a39 rb_eval (80c78b4, 80b761c) + 1c2d > 0805dfeb eval_node (80c78b4, 80b761c) + 3f > 0805e6d6 ruby_exec_internal (808a254, 80c8c50, 0, 0, 0, a00000) + d6 > 0805e74b ruby_exec (808a254, 8058cbe, 8047d4c, 8058cc6, feffb7e4, feffde70) > + 27 > 0805e773 ruby_run (feffb7e4, feffde70, 0, 8047d4c, 8076d8d, feffb7e4) + 23 > 08058cc6 main (5, 8047d74, 8047d8c) + 3a > 08058bfe _start (5, 8047e30, 8047e47, 8047e67, 8047e6a, 8047e75) + 7a > > _______________________________________________ > Eventmachine-talk mailing list > Eventmachine-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/eventmachine-talk >
Roger Pack
2009-Jan-20 12:13 UTC
[Eventmachine-talk] thin occasionally aborts in eventmachine on
Perhaps it is selecting for writable, then the socket is closed, then it tries to write out? On Tue, Jan 20, 2009 at 12:24 PM, Aman Gupta <themastermind1 at gmail.com> wrote:> Hmm, usually this pops up on OSX when using kqueue, but it looks like > you''re using the select() reactor. > > I''m not quite sure why this is happening.. the fd should only be > passed to select() if bConnectPending, or OutboundPages.size()>0. > Since OutboundPages.size()==0, bConnectPending must be set, but then > Write() wouldn''t have called _WriteOutboundData(). > > Can you check what the value of bConnectPending is when the error > occurs? You could also just put in a check at the beginning of > _WriteOutboundData(): > > if (OutboundPages.size() == 0) return; > > Aman > > On Tue, Jan 20, 2009 at 7:22 AM, Amanda Waite <Amanda.Waite at sun.com> wrote: >> I''ve been testing with Thin 1.0.0 on OpenSolaris 2008.11. Every now and >> again and particularly under load, a thin will abort in eventmachine and >> will drop core. The error is caused by an assertion that in a call to >> ed->Write() there should be some data to write and there isn''t. The stack >> trace is at the end of this message. I''ve only been able to recreate this >> with a single thin, with about 35-40 concurrent users. The App is a Rails >> v2.1.0 application (http://incubator.apache.org/olio/). >> >> The error is: >> >> Assertion failed: nbytes > 0, file ed.cpp, line 622 >> Abort (core dumped) >> >> At the point that the failure happens, on entering >> ConnectionDescriptor::_WriteOutboundData(), OutboundPages.size() = 0, >> although in the only test where it aborted while I was monitoring it, >> OutboundDataSize was 16k >> >> The gem is built with g++ and I''ve tried turning down the optimisation from >> -O3 to none but it made no difference. >> >> Any ideas or suggestion on how to dig deeper into the problem? >> >> Thanks >> >> Amanda, >> >> Stack Trace: >> >> >> root at dn21:/export/faban/olio_rails/olioapp# pstack core >> core ''core'' of 15660: /usr/ruby/1.8/bin/ruby >> /var/ruby/1.8/gem_home/bin/thin -e production s >> fecb17c5 _lwp_kill (1, 6, 8034820, fec5a02e) + 15 >> fec5a03a raise (6, 0, 8034870, fec315ea) + 22 >> fec3160a abort (65737341, 6f697472, 6166206e, 64656c69, 626e203a, >> 73657479) + f2 >> fec3185a _assert (fea9225d, fea92220, 26e, fea830d6) + 82 >> fea832bf _ZN20ConnectionDescriptor18_WriteOutboundDataEv (a1abc40, 8038b18, >> 8038bdc, fea82fd1) + 1f9 >> fea830b8 _ZN20ConnectionDescriptor5WriteEv (a1abc40, 28, 8038b60, fea89597) >> + f4 >> fea89869 _ZN14EventMachine_t14_RunSelectOnceEv (9aa48c8, 85106, fecb0615, >> fea89254) + 2e1 >> fea892a4 _ZN14EventMachine_t8_RunOnceEv (9aa48c8, 8039254, 0, fea891d8) + 5c >> fea8922f _ZN14EventMachine_t3RunEv (9aa48c8, 808a254, 2, fea72745) + 63 >> fea7283e evma_run_machine (9639760, 808a254, 80391fc, 80681fc, 83b15c0, >> 808a254) + 106 >> fea7796c _Z29t_run_machine_without_threadsm (83b15c0) + 18 >> 080681fc rb_call0 (83b1480, 83b15c0, 5161, 5161, 0, 0) + 998 >> 08068dd2 rb_call (83b1480, 83b15c0, 5161, 0, 0, 2) + 196 >> 080627f9 rb_eval (83b15c0, 83c71b8) + 19ed >> 08063241 rb_eval (83b15c0, 83c7654) + 2435 >> 080686ca rb_call0 (83b1480, 83b15c0, 1411, 1411, 0, 0) + e66 >> 08068dd2 rb_call (83b1480, 83b15c0, 1411, 0, 0, 0) + 196 >> 08062c99 rb_eval (842937c, 8426f78) + 1e8d >> 0806ea95 block_pass (842937c, 8426fa0) + 3a1 >> 080636c8 rb_eval (842937c, 8427158) + 28bc >> 080686ca rb_call0 (84249e4, 842937c, 13c1, 13c1, 0, 0) + e66 >> 08068dd2 rb_call (84249e4, 842937c, 13c1, 0, 0, 0) + 196 >> 08062c99 rb_eval (84297f0, 82c1c28) + 1e8d >> 080686ca rb_call0 (82b9ba4, 84297f0, 13c1, 13c1, 0, 0) + e66 >> 08068dd2 rb_call (82b9ba4, 84297f0, 13c1, 0, 0, 0) + 196 >> 08062c99 rb_eval (8433098, 842fde4) + 1e8d >> 080686ca rb_call0 (8429840, 8433098, 13c1, 13c1, 0, 8041464) + e66 >> 08068dd2 rb_call (8429840, 8433098, 13c1, 0, 8041464, 1) + 196 >> 08068fbc rb_f_send (1, 8041460, 8433098) + f4 >> 0806820d rb_call0 (80c8bb0, 8433098, fd1, fd1, 1, 8041460) + 9a9 >> 08068dd2 rb_call (80c8bb0, 8433098, fd1, 1, 8041460, 0) + 196 >> 08062c99 rb_eval (82ce4b4, 8360abc) + 1e8d >> 0806344a rb_eval (82ce4b4, 83612dc) + 263e >> 080686ca rb_call0 (82cda78, 82ce4b4, 6351, 6351, 0, 0) + e66 >> 08068dd2 rb_call (82cda78, 82ce4b4, 6351, 0, 0, 2) + 196 >> 080627f9 rb_eval (82ce4b4, 8361534) + 19ed >> 080686ca rb_call0 (82cda78, 82ce4b4, 4ecf, 4ecf, 0, 0) + e66 >> 08068dd2 rb_call (82cda78, 82ce4b4, 4ecf, 0, 0, 0) + 196 >> 08062c99 rb_eval (80c78b4, 83ec9a4) + 1e8d >> 0805dfeb eval_node (80c78b4, 83ec9a4) + 3f >> 0806a819 rb_load (83ecbd4, 0) + 391 >> 0806aaf4 rb_f_load (1, 80469e0, 80c78b4) + 48 >> 0806820d rb_call0 (80c8bb0, 80c78b4, 25c1, 25c1, 1, 80469e0) + 9a9 >> 08068dd2 rb_call (80c8bb0, 80c78b4, 25c1, 1, 80469e0, 1) + 196 >> 08062a39 rb_eval (80c78b4, 80b761c) + 1c2d >> 0805dfeb eval_node (80c78b4, 80b761c) + 3f >> 0805e6d6 ruby_exec_internal (808a254, 80c8c50, 0, 0, 0, a00000) + d6 >> 0805e74b ruby_exec (808a254, 8058cbe, 8047d4c, 8058cc6, feffb7e4, feffde70) >> + 27 >> 0805e773 ruby_run (feffb7e4, feffde70, 0, 8047d4c, 8076d8d, feffb7e4) + 23 >> 08058cc6 main (5, 8047d74, 8047d8c) + 3a >> 08058bfe _start (5, 8047e30, 8047e47, 8047e67, 8047e6a, 8047e75) + 7a >> >> _______________________________________________ >> Eventmachine-talk mailing list >> Eventmachine-talk at rubyforge.org >> http://rubyforge.org/mailman/listinfo/eventmachine-talk >> > _______________________________________________ > Eventmachine-talk mailing list > Eventmachine-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/eventmachine-talk >-- 2 Timothy 1:7
Amanda waite
2009-Jan-20 13:41 UTC
[Eventmachine-talk] thin occasionally aborts in eventmachine
Aman Gupta wrote:> Hmm, usually this pops up on OSX when using kqueue, but it looks like > you''re using the select() reactor. > > I''m not quite sure why this is happening.. the fd should only be > passed to select() if bConnectPending, or OutboundPages.size()>0. > Since OutboundPages.size()==0, bConnectPending must be set, but then > Write() wouldn''t have called _WriteOutboundData(). > > Can you check what the value of bConnectPending is when the error > occurs?bConnectPending is false when the error occurs, both from running dbx on the core file and from some logging I added. bNotifyWritable is also false, if that helps. ed->Write was called from: if (FD_ISSET (sd, &(SelectData.fdwrites))) ed->Write(); Which is probably obvious, but I can''t see in this code path where a check is made for OutboundPages.size()>0 other than just before the assert that causes the abort.> You could also just put in a check at the beginning of > _WriteOutboundData(): > > if (OutboundPages.size() == 0) return; >I''ll try this and will add some logging to show when it happens as it may not be obvious. Thanks to everyone for their advice so far. Amanda> Aman > > On Tue, Jan 20, 2009 at 7:22 AM, Amanda Waite <Amanda.Waite at sun.com> wrote: > >> I''ve been testing with Thin 1.0.0 on OpenSolaris 2008.11. Every now and >> again and particularly under load, a thin will abort in eventmachine and >> will drop core. The error is caused by an assertion that in a call to >> ed->Write() there should be some data to write and there isn''t. The stack >> trace is at the end of this message. I''ve only been able to recreate this >> with a single thin, with about 35-40 concurrent users. The App is a Rails >> v2.1.0 application (http://incubator.apache.org/olio/). >> >> The error is: >> >> Assertion failed: nbytes > 0, file ed.cpp, line 622 >> Abort (core dumped) >> >> At the point that the failure happens, on entering >> ConnectionDescriptor::_WriteOutboundData(), OutboundPages.size() = 0, >> although in the only test where it aborted while I was monitoring it, >> OutboundDataSize was 16k >> >> The gem is built with g++ and I''ve tried turning down the optimisation from >> -O3 to none but it made no difference. >> >> Any ideas or suggestion on how to dig deeper into the problem? >> >> Thanks >> >> Amanda, >> >> Stack Trace: >> >> >> root at dn21:/export/faban/olio_rails/olioapp# pstack core >> core ''core'' of 15660: /usr/ruby/1.8/bin/ruby >> /var/ruby/1.8/gem_home/bin/thin -e production s >> fecb17c5 _lwp_kill (1, 6, 8034820, fec5a02e) + 15 >> fec5a03a raise (6, 0, 8034870, fec315ea) + 22 >> fec3160a abort (65737341, 6f697472, 6166206e, 64656c69, 626e203a, >> 73657479) + f2 >> fec3185a _assert (fea9225d, fea92220, 26e, fea830d6) + 82 >> fea832bf _ZN20ConnectionDescriptor18_WriteOutboundDataEv (a1abc40, 8038b18, >> 8038bdc, fea82fd1) + 1f9 >> fea830b8 _ZN20ConnectionDescriptor5WriteEv (a1abc40, 28, 8038b60, fea89597) >> + f4 >> fea89869 _ZN14EventMachine_t14_RunSelectOnceEv (9aa48c8, 85106, fecb0615, >> fea89254) + 2e1 >> fea892a4 _ZN14EventMachine_t8_RunOnceEv (9aa48c8, 8039254, 0, fea891d8) + 5c >> fea8922f _ZN14EventMachine_t3RunEv (9aa48c8, 808a254, 2, fea72745) + 63 >> fea7283e evma_run_machine (9639760, 808a254, 80391fc, 80681fc, 83b15c0, >> 808a254) + 106 >> fea7796c _Z29t_run_machine_without_threadsm (83b15c0) + 18 >> 080681fc rb_call0 (83b1480, 83b15c0, 5161, 5161, 0, 0) + 998 >> 08068dd2 rb_call (83b1480, 83b15c0, 5161, 0, 0, 2) + 196 >> 080627f9 rb_eval (83b15c0, 83c71b8) + 19ed >> 08063241 rb_eval (83b15c0, 83c7654) + 2435 >> 080686ca rb_call0 (83b1480, 83b15c0, 1411, 1411, 0, 0) + e66 >> 08068dd2 rb_call (83b1480, 83b15c0, 1411, 0, 0, 0) + 196 >> 08062c99 rb_eval (842937c, 8426f78) + 1e8d >> 0806ea95 block_pass (842937c, 8426fa0) + 3a1 >> 080636c8 rb_eval (842937c, 8427158) + 28bc >> 080686ca rb_call0 (84249e4, 842937c, 13c1, 13c1, 0, 0) + e66 >> 08068dd2 rb_call (84249e4, 842937c, 13c1, 0, 0, 0) + 196 >> 08062c99 rb_eval (84297f0, 82c1c28) + 1e8d >> 080686ca rb_call0 (82b9ba4, 84297f0, 13c1, 13c1, 0, 0) + e66 >> 08068dd2 rb_call (82b9ba4, 84297f0, 13c1, 0, 0, 0) + 196 >> 08062c99 rb_eval (8433098, 842fde4) + 1e8d >> 080686ca rb_call0 (8429840, 8433098, 13c1, 13c1, 0, 8041464) + e66 >> 08068dd2 rb_call (8429840, 8433098, 13c1, 0, 8041464, 1) + 196 >> 08068fbc rb_f_send (1, 8041460, 8433098) + f4 >> 0806820d rb_call0 (80c8bb0, 8433098, fd1, fd1, 1, 8041460) + 9a9 >> 08068dd2 rb_call (80c8bb0, 8433098, fd1, 1, 8041460, 0) + 196 >> 08062c99 rb_eval (82ce4b4, 8360abc) + 1e8d >> 0806344a rb_eval (82ce4b4, 83612dc) + 263e >> 080686ca rb_call0 (82cda78, 82ce4b4, 6351, 6351, 0, 0) + e66 >> 08068dd2 rb_call (82cda78, 82ce4b4, 6351, 0, 0, 2) + 196 >> 080627f9 rb_eval (82ce4b4, 8361534) + 19ed >> 080686ca rb_call0 (82cda78, 82ce4b4, 4ecf, 4ecf, 0, 0) + e66 >> 08068dd2 rb_call (82cda78, 82ce4b4, 4ecf, 0, 0, 0) + 196 >> 08062c99 rb_eval (80c78b4, 83ec9a4) + 1e8d >> 0805dfeb eval_node (80c78b4, 83ec9a4) + 3f >> 0806a819 rb_load (83ecbd4, 0) + 391 >> 0806aaf4 rb_f_load (1, 80469e0, 80c78b4) + 48 >> 0806820d rb_call0 (80c8bb0, 80c78b4, 25c1, 25c1, 1, 80469e0) + 9a9 >> 08068dd2 rb_call (80c8bb0, 80c78b4, 25c1, 1, 80469e0, 1) + 196 >> 08062a39 rb_eval (80c78b4, 80b761c) + 1c2d >> 0805dfeb eval_node (80c78b4, 80b761c) + 3f >> 0805e6d6 ruby_exec_internal (808a254, 80c8c50, 0, 0, 0, a00000) + d6 >> 0805e74b ruby_exec (808a254, 8058cbe, 8047d4c, 8058cc6, feffb7e4, feffde70) >> + 27 >> 0805e773 ruby_run (feffb7e4, feffde70, 0, 8047d4c, 8076d8d, feffb7e4) + 23 >> 08058cc6 main (5, 8047d74, 8047d8c) + 3a >> 08058bfe _start (5, 8047e30, 8047e47, 8047e67, 8047e6a, 8047e75) + 7a >> >> _______________________________________________ >> Eventmachine-talk mailing list >> Eventmachine-talk at rubyforge.org >> http://rubyforge.org/mailman/listinfo/eventmachine-talk >> >> > _______________________________________________ > Eventmachine-talk mailing list > Eventmachine-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/eventmachine-talk >
Mark V
2009-Jan-20 13:43 UTC
[Eventmachine-talk] thin occasionally aborts in eventmachine on
On Wed, Jan 21, 2009 at 2:22 AM, Amanda Waite <Amanda.Waite at sun.com> wrote:> I''ve been testing with Thin 1.0.0 on OpenSolaris 2008.11. Every now and > again and particularly under load, a thin will abort in eventmachine and > will drop core. The error is caused by an assertion that in a call to > ed->Write() there should be some data to write and there isn''t. The stack > trace is at the end of this message. I''ve only been able to recreate this > with a single thin, with about 35-40 concurrent users. The App is a Rails > v2.1.0 application (http://incubator.apache.org/olio/). > > The error is: > > Assertion failed: nbytes > 0, file ed.cpp, line 622 > Abort (core dumped) > > At the point that the failure happens, on entering > ConnectionDescriptor::_WriteOutboundData(), OutboundPages.size() = 0, > although in the only test where it aborted while I was monitoring it, > OutboundDataSize was 16k > > The gem is built with g++ and I''ve tried turning down the optimisation from > -O3 to none but it made no difference.Hmm, I ran into a only when under load problem, on Linux.... but the experience might give you some ideas? In my case a file copy from the local HDD to a PVFS2 area (network oriented), using ''dd'' completed OK, but failed verification only when /both/ the local HDD and network connection were under _very_ heavy load (both saturated). You might be able to simulate that circumstance?> Any ideas or suggestion on how to dig deeper into the problem?The symptom I saw was ''lost ticks'' being reported in /var/log/messages when the log level was turned up. Does OpenSolaris use tickless clocks? One suggestion I encountered was to turn off any CPU scaling - not sure if that applies to your OS? That might give you somethings to poke at? Mark> > Thanks > > Amanda, > > Stack Trace: > > > root at dn21:/export/faban/olio_rails/olioapp# pstack core > core ''core'' of 15660: /usr/ruby/1.8/bin/ruby > /var/ruby/1.8/gem_home/bin/thin -e production s > fecb17c5 _lwp_kill (1, 6, 8034820, fec5a02e) + 15 > fec5a03a raise (6, 0, 8034870, fec315ea) + 22 > fec3160a abort (65737341, 6f697472, 6166206e, 64656c69, 626e203a, > 73657479) + f2 > fec3185a _assert (fea9225d, fea92220, 26e, fea830d6) + 82 > fea832bf _ZN20ConnectionDescriptor18_WriteOutboundDataEv (a1abc40, 8038b18, > 8038bdc, fea82fd1) + 1f9 > fea830b8 _ZN20ConnectionDescriptor5WriteEv (a1abc40, 28, 8038b60, fea89597) > + f4 > fea89869 _ZN14EventMachine_t14_RunSelectOnceEv (9aa48c8, 85106, fecb0615, > fea89254) + 2e1 > fea892a4 _ZN14EventMachine_t8_RunOnceEv (9aa48c8, 8039254, 0, fea891d8) + 5c > fea8922f _ZN14EventMachine_t3RunEv (9aa48c8, 808a254, 2, fea72745) + 63 > fea7283e evma_run_machine (9639760, 808a254, 80391fc, 80681fc, 83b15c0, > 808a254) + 106 > fea7796c _Z29t_run_machine_without_threadsm (83b15c0) + 18 > 080681fc rb_call0 (83b1480, 83b15c0, 5161, 5161, 0, 0) + 998 > 08068dd2 rb_call (83b1480, 83b15c0, 5161, 0, 0, 2) + 196 > 080627f9 rb_eval (83b15c0, 83c71b8) + 19ed > 08063241 rb_eval (83b15c0, 83c7654) + 2435 > 080686ca rb_call0 (83b1480, 83b15c0, 1411, 1411, 0, 0) + e66 > 08068dd2 rb_call (83b1480, 83b15c0, 1411, 0, 0, 0) + 196 > 08062c99 rb_eval (842937c, 8426f78) + 1e8d > 0806ea95 block_pass (842937c, 8426fa0) + 3a1 > 080636c8 rb_eval (842937c, 8427158) + 28bc > 080686ca rb_call0 (84249e4, 842937c, 13c1, 13c1, 0, 0) + e66 > 08068dd2 rb_call (84249e4, 842937c, 13c1, 0, 0, 0) + 196 > 08062c99 rb_eval (84297f0, 82c1c28) + 1e8d > 080686ca rb_call0 (82b9ba4, 84297f0, 13c1, 13c1, 0, 0) + e66 > 08068dd2 rb_call (82b9ba4, 84297f0, 13c1, 0, 0, 0) + 196 > 08062c99 rb_eval (8433098, 842fde4) + 1e8d > 080686ca rb_call0 (8429840, 8433098, 13c1, 13c1, 0, 8041464) + e66 > 08068dd2 rb_call (8429840, 8433098, 13c1, 0, 8041464, 1) + 196 > 08068fbc rb_f_send (1, 8041460, 8433098) + f4 > 0806820d rb_call0 (80c8bb0, 8433098, fd1, fd1, 1, 8041460) + 9a9 > 08068dd2 rb_call (80c8bb0, 8433098, fd1, 1, 8041460, 0) + 196 > 08062c99 rb_eval (82ce4b4, 8360abc) + 1e8d > 0806344a rb_eval (82ce4b4, 83612dc) + 263e > 080686ca rb_call0 (82cda78, 82ce4b4, 6351, 6351, 0, 0) + e66 > 08068dd2 rb_call (82cda78, 82ce4b4, 6351, 0, 0, 2) + 196 > 080627f9 rb_eval (82ce4b4, 8361534) + 19ed > 080686ca rb_call0 (82cda78, 82ce4b4, 4ecf, 4ecf, 0, 0) + e66 > 08068dd2 rb_call (82cda78, 82ce4b4, 4ecf, 0, 0, 0) + 196 > 08062c99 rb_eval (80c78b4, 83ec9a4) + 1e8d > 0805dfeb eval_node (80c78b4, 83ec9a4) + 3f > 0806a819 rb_load (83ecbd4, 0) + 391 > 0806aaf4 rb_f_load (1, 80469e0, 80c78b4) + 48 > 0806820d rb_call0 (80c8bb0, 80c78b4, 25c1, 25c1, 1, 80469e0) + 9a9 > 08068dd2 rb_call (80c8bb0, 80c78b4, 25c1, 1, 80469e0, 1) + 196 > 08062a39 rb_eval (80c78b4, 80b761c) + 1c2d > 0805dfeb eval_node (80c78b4, 80b761c) + 3f > 0805e6d6 ruby_exec_internal (808a254, 80c8c50, 0, 0, 0, a00000) + d6 > 0805e74b ruby_exec (808a254, 8058cbe, 8047d4c, 8058cc6, feffb7e4, feffde70) > + 27 > 0805e773 ruby_run (feffb7e4, feffde70, 0, 8047d4c, 8076d8d, feffb7e4) + 23 > 08058cc6 main (5, 8047d74, 8047d8c) + 3a > 08058bfe _start (5, 8047e30, 8047e47, 8047e67, 8047e6a, 8047e75) + 7a > > _______________________________________________ > Eventmachine-talk mailing list > Eventmachine-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/eventmachine-talk >
Aman Gupta
2009-Jan-20 14:11 UTC
[Eventmachine-talk] thin occasionally aborts in eventmachine on
>> Can you check what the value of bConnectPending is when the error >> occurs? > > bConnectPending is false when the error occurs, both from running dbx on the > core file and from > some logging I added. bNotifyWritable is also false, if that helps. > > ed->Write was called from: > > if (FD_ISSET (sd, &(SelectData.fdwrites))) > ed->Write(); > > Which is probably obvious, but I can''t see in this code path where a check > is made for > OutboundPages.size()>0 other than just before the assert that causes the > abort.Before calling select, it adds the fd to the fdset if SelectForWrite() is true (em.cpp:748): if (ed->SelectForWrite()) FD_SET (sd, &(SelectData.fdwrites)); and in SelectForWrite, it checks for bConnectPending and Outbound pages (ed.cpp:373): if (bConnectPending || bNotifyWritable) return true; else { return (GetOutboundDataSize() > 0); } I''m not quite sure why this function uses GetOutboundDataSize() instead of OutboundPages.size(). Interestingly, DatagramDescriptor::SelectForWrite() uses OutboundPages.size() instead: //return (GetOutboundDataSize() > 0); (Original) return (OutboundPages.size() > 0); So presumably a similar issue was found in the UDP code and fixed. GetOutboundDataSize() returns OutboundDataSize, so it sounds like OutboundDataSize and OutboundPages.size() are getting out of sync somewhere. Can you try changing SelectForWrite to return (OutboundPages.size() > 0)? Aman> > >> You could also just put in a check at the beginning of >> _WriteOutboundData(): >> >> if (OutboundPages.size() == 0) return; >> > > I''ll try this and will add some logging to show when it happens as it may > not be obvious. > > Thanks to everyone for their advice so far. > > Amanda >> >> Aman >> >> On Tue, Jan 20, 2009 at 7:22 AM, Amanda Waite <Amanda.Waite at sun.com> >> wrote: >> >>> >>> I''ve been testing with Thin 1.0.0 on OpenSolaris 2008.11. Every now and >>> again and particularly under load, a thin will abort in eventmachine and >>> will drop core. The error is caused by an assertion that in a call to >>> ed->Write() there should be some data to write and there isn''t. The stack >>> trace is at the end of this message. I''ve only been able to recreate this >>> with a single thin, with about 35-40 concurrent users. The App is a Rails >>> v2.1.0 application (http://incubator.apache.org/olio/). >>> >>> The error is: >>> >>> Assertion failed: nbytes > 0, file ed.cpp, line 622 >>> Abort (core dumped) >>> >>> At the point that the failure happens, on entering >>> ConnectionDescriptor::_WriteOutboundData(), OutboundPages.size() = 0, >>> although in the only test where it aborted while I was monitoring it, >>> OutboundDataSize was 16k >>> >>> The gem is built with g++ and I''ve tried turning down the optimisation >>> from >>> -O3 to none but it made no difference. >>> >>> Any ideas or suggestion on how to dig deeper into the problem? >>> >>> Thanks >>> >>> Amanda, >>> >>> Stack Trace: >>> >>> >>> root at dn21:/export/faban/olio_rails/olioapp# pstack core >>> core ''core'' of 15660: /usr/ruby/1.8/bin/ruby >>> /var/ruby/1.8/gem_home/bin/thin -e production s >>> fecb17c5 _lwp_kill (1, 6, 8034820, fec5a02e) + 15 >>> fec5a03a raise (6, 0, 8034870, fec315ea) + 22 >>> fec3160a abort (65737341, 6f697472, 6166206e, 64656c69, 626e203a, >>> 73657479) + f2 >>> fec3185a _assert (fea9225d, fea92220, 26e, fea830d6) + 82 >>> fea832bf _ZN20ConnectionDescriptor18_WriteOutboundDataEv (a1abc40, >>> 8038b18, >>> 8038bdc, fea82fd1) + 1f9 >>> fea830b8 _ZN20ConnectionDescriptor5WriteEv (a1abc40, 28, 8038b60, >>> fea89597) >>> + f4 >>> fea89869 _ZN14EventMachine_t14_RunSelectOnceEv (9aa48c8, 85106, fecb0615, >>> fea89254) + 2e1 >>> fea892a4 _ZN14EventMachine_t8_RunOnceEv (9aa48c8, 8039254, 0, fea891d8) + >>> 5c >>> fea8922f _ZN14EventMachine_t3RunEv (9aa48c8, 808a254, 2, fea72745) + 63 >>> fea7283e evma_run_machine (9639760, 808a254, 80391fc, 80681fc, 83b15c0, >>> 808a254) + 106 >>> fea7796c _Z29t_run_machine_without_threadsm (83b15c0) + 18 >>> 080681fc rb_call0 (83b1480, 83b15c0, 5161, 5161, 0, 0) + 998 >>> 08068dd2 rb_call (83b1480, 83b15c0, 5161, 0, 0, 2) + 196 >>> 080627f9 rb_eval (83b15c0, 83c71b8) + 19ed >>> 08063241 rb_eval (83b15c0, 83c7654) + 2435 >>> 080686ca rb_call0 (83b1480, 83b15c0, 1411, 1411, 0, 0) + e66 >>> 08068dd2 rb_call (83b1480, 83b15c0, 1411, 0, 0, 0) + 196 >>> 08062c99 rb_eval (842937c, 8426f78) + 1e8d >>> 0806ea95 block_pass (842937c, 8426fa0) + 3a1 >>> 080636c8 rb_eval (842937c, 8427158) + 28bc >>> 080686ca rb_call0 (84249e4, 842937c, 13c1, 13c1, 0, 0) + e66 >>> 08068dd2 rb_call (84249e4, 842937c, 13c1, 0, 0, 0) + 196 >>> 08062c99 rb_eval (84297f0, 82c1c28) + 1e8d >>> 080686ca rb_call0 (82b9ba4, 84297f0, 13c1, 13c1, 0, 0) + e66 >>> 08068dd2 rb_call (82b9ba4, 84297f0, 13c1, 0, 0, 0) + 196 >>> 08062c99 rb_eval (8433098, 842fde4) + 1e8d >>> 080686ca rb_call0 (8429840, 8433098, 13c1, 13c1, 0, 8041464) + e66 >>> 08068dd2 rb_call (8429840, 8433098, 13c1, 0, 8041464, 1) + 196 >>> 08068fbc rb_f_send (1, 8041460, 8433098) + f4 >>> 0806820d rb_call0 (80c8bb0, 8433098, fd1, fd1, 1, 8041460) + 9a9 >>> 08068dd2 rb_call (80c8bb0, 8433098, fd1, 1, 8041460, 0) + 196 >>> 08062c99 rb_eval (82ce4b4, 8360abc) + 1e8d >>> 0806344a rb_eval (82ce4b4, 83612dc) + 263e >>> 080686ca rb_call0 (82cda78, 82ce4b4, 6351, 6351, 0, 0) + e66 >>> 08068dd2 rb_call (82cda78, 82ce4b4, 6351, 0, 0, 2) + 196 >>> 080627f9 rb_eval (82ce4b4, 8361534) + 19ed >>> 080686ca rb_call0 (82cda78, 82ce4b4, 4ecf, 4ecf, 0, 0) + e66 >>> 08068dd2 rb_call (82cda78, 82ce4b4, 4ecf, 0, 0, 0) + 196 >>> 08062c99 rb_eval (80c78b4, 83ec9a4) + 1e8d >>> 0805dfeb eval_node (80c78b4, 83ec9a4) + 3f >>> 0806a819 rb_load (83ecbd4, 0) + 391 >>> 0806aaf4 rb_f_load (1, 80469e0, 80c78b4) + 48 >>> 0806820d rb_call0 (80c8bb0, 80c78b4, 25c1, 25c1, 1, 80469e0) + 9a9 >>> 08068dd2 rb_call (80c8bb0, 80c78b4, 25c1, 1, 80469e0, 1) + 196 >>> 08062a39 rb_eval (80c78b4, 80b761c) + 1c2d >>> 0805dfeb eval_node (80c78b4, 80b761c) + 3f >>> 0805e6d6 ruby_exec_internal (808a254, 80c8c50, 0, 0, 0, a00000) + d6 >>> 0805e74b ruby_exec (808a254, 8058cbe, 8047d4c, 8058cc6, feffb7e4, >>> feffde70) >>> + 27 >>> 0805e773 ruby_run (feffb7e4, feffde70, 0, 8047d4c, 8076d8d, feffb7e4) + >>> 23 >>> 08058cc6 main (5, 8047d74, 8047d8c) + 3a >>> 08058bfe _start (5, 8047e30, 8047e47, 8047e67, 8047e6a, 8047e75) + 7a >>> >>> _______________________________________________ >>> Eventmachine-talk mailing list >>> Eventmachine-talk at rubyforge.org >>> http://rubyforge.org/mailman/listinfo/eventmachine-talk >>> >>> >> >> _______________________________________________ >> Eventmachine-talk mailing list >> Eventmachine-talk at rubyforge.org >> http://rubyforge.org/mailman/listinfo/eventmachine-talk >> > > _______________________________________________ > Eventmachine-talk mailing list > Eventmachine-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/eventmachine-talk >
Aman Gupta
2009-Jan-21 07:51 UTC
[Eventmachine-talk] thin occasionally aborts in eventmachine on
I think I might have found the issue. Can you confirm if this fixes the failure: diff --git a/ext/ed.cpp b/ext/ed.cpp index 27e2d11..1f82443 100644 --- a/ext/ed.cpp +++ b/ext/ed.cpp @@ -624,7 +624,7 @@ void ConnectionDescriptor::_WriteOutboundData() assert (GetSocket() != INVALID_SOCKET); int bytes_written = send (GetSocket(), output_buffer, nbytes, 0); - if (bytes_written > 0) { + if (bytes_written >= 0) { OutboundDataSize -= bytes_written; if ((size_t)bytes_written < nbytes) { int len = nbytes - bytes_written; Aman On Tue, Jan 20, 2009 at 6:11 PM, Aman Gupta <themastermind1 at gmail.com> wrote:>>> Can you check what the value of bConnectPending is when the error >>> occurs? >> >> bConnectPending is false when the error occurs, both from running dbx on the >> core file and from >> some logging I added. bNotifyWritable is also false, if that helps. >> >> ed->Write was called from: >> >> if (FD_ISSET (sd, &(SelectData.fdwrites))) >> ed->Write(); >> >> Which is probably obvious, but I can''t see in this code path where a check >> is made for >> OutboundPages.size()>0 other than just before the assert that causes the >> abort. > > Before calling select, it adds the fd to the fdset if SelectForWrite() > is true (em.cpp:748): > > if (ed->SelectForWrite()) > FD_SET (sd, &(SelectData.fdwrites)); > > and in SelectForWrite, it checks for bConnectPending and Outbound > pages (ed.cpp:373): > > if (bConnectPending || bNotifyWritable) > return true; > else { > return (GetOutboundDataSize() > 0); > } > > I''m not quite sure why this function uses GetOutboundDataSize() > instead of OutboundPages.size(). Interestingly, > DatagramDescriptor::SelectForWrite() uses OutboundPages.size() > instead: > > //return (GetOutboundDataSize() > 0); (Original) > return (OutboundPages.size() > 0); > > So presumably a similar issue was found in the UDP code and fixed. > > GetOutboundDataSize() returns OutboundDataSize, so it sounds like > OutboundDataSize and OutboundPages.size() are getting out of sync > somewhere. Can you try changing SelectForWrite to return > (OutboundPages.size() > 0)? > > Aman > >> >> >>> You could also just put in a check at the beginning of >>> _WriteOutboundData(): >>> >>> if (OutboundPages.size() == 0) return; >>> >> >> I''ll try this and will add some logging to show when it happens as it may >> not be obvious. >> >> Thanks to everyone for their advice so far. >> >> Amanda >>> >>> Aman >>> >>> On Tue, Jan 20, 2009 at 7:22 AM, Amanda Waite <Amanda.Waite at sun.com> >>> wrote: >>> >>>> >>>> I''ve been testing with Thin 1.0.0 on OpenSolaris 2008.11. Every now and >>>> again and particularly under load, a thin will abort in eventmachine and >>>> will drop core. The error is caused by an assertion that in a call to >>>> ed->Write() there should be some data to write and there isn''t. The stack >>>> trace is at the end of this message. I''ve only been able to recreate this >>>> with a single thin, with about 35-40 concurrent users. The App is a Rails >>>> v2.1.0 application (http://incubator.apache.org/olio/). >>>> >>>> The error is: >>>> >>>> Assertion failed: nbytes > 0, file ed.cpp, line 622 >>>> Abort (core dumped) >>>> >>>> At the point that the failure happens, on entering >>>> ConnectionDescriptor::_WriteOutboundData(), OutboundPages.size() = 0, >>>> although in the only test where it aborted while I was monitoring it, >>>> OutboundDataSize was 16k >>>> >>>> The gem is built with g++ and I''ve tried turning down the optimisation >>>> from >>>> -O3 to none but it made no difference. >>>> >>>> Any ideas or suggestion on how to dig deeper into the problem? >>>> >>>> Thanks >>>> >>>> Amanda, >>>> >>>> Stack Trace: >>>> >>>> >>>> root at dn21:/export/faban/olio_rails/olioapp# pstack core >>>> core ''core'' of 15660: /usr/ruby/1.8/bin/ruby >>>> /var/ruby/1.8/gem_home/bin/thin -e production s >>>> fecb17c5 _lwp_kill (1, 6, 8034820, fec5a02e) + 15 >>>> fec5a03a raise (6, 0, 8034870, fec315ea) + 22 >>>> fec3160a abort (65737341, 6f697472, 6166206e, 64656c69, 626e203a, >>>> 73657479) + f2 >>>> fec3185a _assert (fea9225d, fea92220, 26e, fea830d6) + 82 >>>> fea832bf _ZN20ConnectionDescriptor18_WriteOutboundDataEv (a1abc40, >>>> 8038b18, >>>> 8038bdc, fea82fd1) + 1f9 >>>> fea830b8 _ZN20ConnectionDescriptor5WriteEv (a1abc40, 28, 8038b60, >>>> fea89597) >>>> + f4 >>>> fea89869 _ZN14EventMachine_t14_RunSelectOnceEv (9aa48c8, 85106, fecb0615, >>>> fea89254) + 2e1 >>>> fea892a4 _ZN14EventMachine_t8_RunOnceEv (9aa48c8, 8039254, 0, fea891d8) + >>>> 5c >>>> fea8922f _ZN14EventMachine_t3RunEv (9aa48c8, 808a254, 2, fea72745) + 63 >>>> fea7283e evma_run_machine (9639760, 808a254, 80391fc, 80681fc, 83b15c0, >>>> 808a254) + 106 >>>> fea7796c _Z29t_run_machine_without_threadsm (83b15c0) + 18 >>>> 080681fc rb_call0 (83b1480, 83b15c0, 5161, 5161, 0, 0) + 998 >>>> 08068dd2 rb_call (83b1480, 83b15c0, 5161, 0, 0, 2) + 196 >>>> 080627f9 rb_eval (83b15c0, 83c71b8) + 19ed >>>> 08063241 rb_eval (83b15c0, 83c7654) + 2435 >>>> 080686ca rb_call0 (83b1480, 83b15c0, 1411, 1411, 0, 0) + e66 >>>> 08068dd2 rb_call (83b1480, 83b15c0, 1411, 0, 0, 0) + 196 >>>> 08062c99 rb_eval (842937c, 8426f78) + 1e8d >>>> 0806ea95 block_pass (842937c, 8426fa0) + 3a1 >>>> 080636c8 rb_eval (842937c, 8427158) + 28bc >>>> 080686ca rb_call0 (84249e4, 842937c, 13c1, 13c1, 0, 0) + e66 >>>> 08068dd2 rb_call (84249e4, 842937c, 13c1, 0, 0, 0) + 196 >>>> 08062c99 rb_eval (84297f0, 82c1c28) + 1e8d >>>> 080686ca rb_call0 (82b9ba4, 84297f0, 13c1, 13c1, 0, 0) + e66 >>>> 08068dd2 rb_call (82b9ba4, 84297f0, 13c1, 0, 0, 0) + 196 >>>> 08062c99 rb_eval (8433098, 842fde4) + 1e8d >>>> 080686ca rb_call0 (8429840, 8433098, 13c1, 13c1, 0, 8041464) + e66 >>>> 08068dd2 rb_call (8429840, 8433098, 13c1, 0, 8041464, 1) + 196 >>>> 08068fbc rb_f_send (1, 8041460, 8433098) + f4 >>>> 0806820d rb_call0 (80c8bb0, 8433098, fd1, fd1, 1, 8041460) + 9a9 >>>> 08068dd2 rb_call (80c8bb0, 8433098, fd1, 1, 8041460, 0) + 196 >>>> 08062c99 rb_eval (82ce4b4, 8360abc) + 1e8d >>>> 0806344a rb_eval (82ce4b4, 83612dc) + 263e >>>> 080686ca rb_call0 (82cda78, 82ce4b4, 6351, 6351, 0, 0) + e66 >>>> 08068dd2 rb_call (82cda78, 82ce4b4, 6351, 0, 0, 2) + 196 >>>> 080627f9 rb_eval (82ce4b4, 8361534) + 19ed >>>> 080686ca rb_call0 (82cda78, 82ce4b4, 4ecf, 4ecf, 0, 0) + e66 >>>> 08068dd2 rb_call (82cda78, 82ce4b4, 4ecf, 0, 0, 0) + 196 >>>> 08062c99 rb_eval (80c78b4, 83ec9a4) + 1e8d >>>> 0805dfeb eval_node (80c78b4, 83ec9a4) + 3f >>>> 0806a819 rb_load (83ecbd4, 0) + 391 >>>> 0806aaf4 rb_f_load (1, 80469e0, 80c78b4) + 48 >>>> 0806820d rb_call0 (80c8bb0, 80c78b4, 25c1, 25c1, 1, 80469e0) + 9a9 >>>> 08068dd2 rb_call (80c8bb0, 80c78b4, 25c1, 1, 80469e0, 1) + 196 >>>> 08062a39 rb_eval (80c78b4, 80b761c) + 1c2d >>>> 0805dfeb eval_node (80c78b4, 80b761c) + 3f >>>> 0805e6d6 ruby_exec_internal (808a254, 80c8c50, 0, 0, 0, a00000) + d6 >>>> 0805e74b ruby_exec (808a254, 8058cbe, 8047d4c, 8058cc6, feffb7e4, >>>> feffde70) >>>> + 27 >>>> 0805e773 ruby_run (feffb7e4, feffde70, 0, 8047d4c, 8076d8d, feffb7e4) + >>>> 23 >>>> 08058cc6 main (5, 8047d74, 8047d8c) + 3a >>>> 08058bfe _start (5, 8047e30, 8047e47, 8047e67, 8047e6a, 8047e75) + 7a >>>> >>>> _______________________________________________ >>>> Eventmachine-talk mailing list >>>> Eventmachine-talk at rubyforge.org >>>> http://rubyforge.org/mailman/listinfo/eventmachine-talk >>>> >>>> >>> >>> _______________________________________________ >>> Eventmachine-talk mailing list >>> Eventmachine-talk at rubyforge.org >>> http://rubyforge.org/mailman/listinfo/eventmachine-talk >>> >> >> _______________________________________________ >> Eventmachine-talk mailing list >> Eventmachine-talk at rubyforge.org >> http://rubyforge.org/mailman/listinfo/eventmachine-talk >> >
Amanda waite
2009-Jan-21 10:57 UTC
[Eventmachine-talk] thin occasionally aborts in eventmachine
Thanks, I''ve been testing with what you gave me yesterday and some logging to indicate when the problem condition occurred and was only just able to make it occur. It didn''t crash but I need to examine the logs and see if anything else went wrong. I''ll give this patch a test too. Thanks for your help Amanda Aman Gupta wrote:> I think I might have found the issue. Can you confirm if this fixes the failure: > > diff --git a/ext/ed.cpp b/ext/ed.cpp > index 27e2d11..1f82443 100644 > --- a/ext/ed.cpp > +++ b/ext/ed.cpp > @@ -624,7 +624,7 @@ void ConnectionDescriptor::_WriteOutboundData() > assert (GetSocket() != INVALID_SOCKET); > int bytes_written = send (GetSocket(), output_buffer, nbytes, 0); > > - if (bytes_written > 0) { > + if (bytes_written >= 0) { > OutboundDataSize -= bytes_written; > if ((size_t)bytes_written < nbytes) { > int len = nbytes - bytes_written; > > Aman > > On Tue, Jan 20, 2009 at 6:11 PM, Aman Gupta <themastermind1 at gmail.com> wrote: > >>>> Can you check what the value of bConnectPending is when the error >>>> occurs? >>>> >>> bConnectPending is false when the error occurs, both from running dbx on the >>> core file and from >>> some logging I added. bNotifyWritable is also false, if that helps. >>> >>> ed->Write was called from: >>> >>> if (FD_ISSET (sd, &(SelectData.fdwrites))) >>> ed->Write(); >>> >>> Which is probably obvious, but I can''t see in this code path where a check >>> is made for >>> OutboundPages.size()>0 other than just before the assert that causes the >>> abort. >>> >> Before calling select, it adds the fd to the fdset if SelectForWrite() >> is true (em.cpp:748): >> >> if (ed->SelectForWrite()) >> FD_SET (sd, &(SelectData.fdwrites)); >> >> and in SelectForWrite, it checks for bConnectPending and Outbound >> pages (ed.cpp:373): >> >> if (bConnectPending || bNotifyWritable) >> return true; >> else { >> return (GetOutboundDataSize() > 0); >> } >> >> I''m not quite sure why this function uses GetOutboundDataSize() >> instead of OutboundPages.size(). Interestingly, >> DatagramDescriptor::SelectForWrite() uses OutboundPages.size() >> instead: >> >> //return (GetOutboundDataSize() > 0); (Original) >> return (OutboundPages.size() > 0); >> >> So presumably a similar issue was found in the UDP code and fixed. >> >> GetOutboundDataSize() returns OutboundDataSize, so it sounds like >> OutboundDataSize and OutboundPages.size() are getting out of sync >> somewhere. Can you try changing SelectForWrite to return >> (OutboundPages.size() > 0)? >> >> Aman >> >> >>> >>>> You could also just put in a check at the beginning of >>>> _WriteOutboundData(): >>>> >>>> if (OutboundPages.size() == 0) return; >>>> >>>> >>> I''ll try this and will add some logging to show when it happens as it may >>> not be obvious. >>> >>> Thanks to everyone for their advice so far. >>> >>> Amanda >>> >>>> Aman >>>> >>>> On Tue, Jan 20, 2009 at 7:22 AM, Amanda Waite <Amanda.Waite at sun.com> >>>> wrote: >>>> >>>> >>>>> I''ve been testing with Thin 1.0.0 on OpenSolaris 2008.11. Every now and >>>>> again and particularly under load, a thin will abort in eventmachine and >>>>> will drop core. The error is caused by an assertion that in a call to >>>>> ed->Write() there should be some data to write and there isn''t. The stack >>>>> trace is at the end of this message. I''ve only been able to recreate this >>>>> with a single thin, with about 35-40 concurrent users. The App is a Rails >>>>> v2.1.0 application (http://incubator.apache.org/olio/). >>>>> >>>>> The error is: >>>>> >>>>> Assertion failed: nbytes > 0, file ed.cpp, line 622 >>>>> Abort (core dumped) >>>>> >>>>> At the point that the failure happens, on entering >>>>> ConnectionDescriptor::_WriteOutboundData(), OutboundPages.size() = 0, >>>>> although in the only test where it aborted while I was monitoring it, >>>>> OutboundDataSize was 16k >>>>> >>>>> The gem is built with g++ and I''ve tried turning down the optimisation >>>>> from >>>>> -O3 to none but it made no difference. >>>>> >>>>> Any ideas or suggestion on how to dig deeper into the problem? >>>>> >>>>> Thanks >>>>> >>>>> Amanda, >>>>> >>>>> Stack Trace: >>>>> >>>>> >>>>> root at dn21:/export/faban/olio_rails/olioapp# pstack core >>>>> core ''core'' of 15660: /usr/ruby/1.8/bin/ruby >>>>> /var/ruby/1.8/gem_home/bin/thin -e production s >>>>> fecb17c5 _lwp_kill (1, 6, 8034820, fec5a02e) + 15 >>>>> fec5a03a raise (6, 0, 8034870, fec315ea) + 22 >>>>> fec3160a abort (65737341, 6f697472, 6166206e, 64656c69, 626e203a, >>>>> 73657479) + f2 >>>>> fec3185a _assert (fea9225d, fea92220, 26e, fea830d6) + 82 >>>>> fea832bf _ZN20ConnectionDescriptor18_WriteOutboundDataEv (a1abc40, >>>>> 8038b18, >>>>> 8038bdc, fea82fd1) + 1f9 >>>>> fea830b8 _ZN20ConnectionDescriptor5WriteEv (a1abc40, 28, 8038b60, >>>>> fea89597) >>>>> + f4 >>>>> fea89869 _ZN14EventMachine_t14_RunSelectOnceEv (9aa48c8, 85106, fecb0615, >>>>> fea89254) + 2e1 >>>>> fea892a4 _ZN14EventMachine_t8_RunOnceEv (9aa48c8, 8039254, 0, fea891d8) + >>>>> 5c >>>>> fea8922f _ZN14EventMachine_t3RunEv (9aa48c8, 808a254, 2, fea72745) + 63 >>>>> fea7283e evma_run_machine (9639760, 808a254, 80391fc, 80681fc, 83b15c0, >>>>> 808a254) + 106 >>>>> fea7796c _Z29t_run_machine_without_threadsm (83b15c0) + 18 >>>>> 080681fc rb_call0 (83b1480, 83b15c0, 5161, 5161, 0, 0) + 998 >>>>> 08068dd2 rb_call (83b1480, 83b15c0, 5161, 0, 0, 2) + 196 >>>>> 080627f9 rb_eval (83b15c0, 83c71b8) + 19ed >>>>> 08063241 rb_eval (83b15c0, 83c7654) + 2435 >>>>> 080686ca rb_call0 (83b1480, 83b15c0, 1411, 1411, 0, 0) + e66 >>>>> 08068dd2 rb_call (83b1480, 83b15c0, 1411, 0, 0, 0) + 196 >>>>> 08062c99 rb_eval (842937c, 8426f78) + 1e8d >>>>> 0806ea95 block_pass (842937c, 8426fa0) + 3a1 >>>>> 080636c8 rb_eval (842937c, 8427158) + 28bc >>>>> 080686ca rb_call0 (84249e4, 842937c, 13c1, 13c1, 0, 0) + e66 >>>>> 08068dd2 rb_call (84249e4, 842937c, 13c1, 0, 0, 0) + 196 >>>>> 08062c99 rb_eval (84297f0, 82c1c28) + 1e8d >>>>> 080686ca rb_call0 (82b9ba4, 84297f0, 13c1, 13c1, 0, 0) + e66 >>>>> 08068dd2 rb_call (82b9ba4, 84297f0, 13c1, 0, 0, 0) + 196 >>>>> 08062c99 rb_eval (8433098, 842fde4) + 1e8d >>>>> 080686ca rb_call0 (8429840, 8433098, 13c1, 13c1, 0, 8041464) + e66 >>>>> 08068dd2 rb_call (8429840, 8433098, 13c1, 0, 8041464, 1) + 196 >>>>> 08068fbc rb_f_send (1, 8041460, 8433098) + f4 >>>>> 0806820d rb_call0 (80c8bb0, 8433098, fd1, fd1, 1, 8041460) + 9a9 >>>>> 08068dd2 rb_call (80c8bb0, 8433098, fd1, 1, 8041460, 0) + 196 >>>>> 08062c99 rb_eval (82ce4b4, 8360abc) + 1e8d >>>>> 0806344a rb_eval (82ce4b4, 83612dc) + 263e >>>>> 080686ca rb_call0 (82cda78, 82ce4b4, 6351, 6351, 0, 0) + e66 >>>>> 08068dd2 rb_call (82cda78, 82ce4b4, 6351, 0, 0, 2) + 196 >>>>> 080627f9 rb_eval (82ce4b4, 8361534) + 19ed >>>>> 080686ca rb_call0 (82cda78, 82ce4b4, 4ecf, 4ecf, 0, 0) + e66 >>>>> 08068dd2 rb_call (82cda78, 82ce4b4, 4ecf, 0, 0, 0) + 196 >>>>> 08062c99 rb_eval (80c78b4, 83ec9a4) + 1e8d >>>>> 0805dfeb eval_node (80c78b4, 83ec9a4) + 3f >>>>> 0806a819 rb_load (83ecbd4, 0) + 391 >>>>> 0806aaf4 rb_f_load (1, 80469e0, 80c78b4) + 48 >>>>> 0806820d rb_call0 (80c8bb0, 80c78b4, 25c1, 25c1, 1, 80469e0) + 9a9 >>>>> 08068dd2 rb_call (80c8bb0, 80c78b4, 25c1, 1, 80469e0, 1) + 196 >>>>> 08062a39 rb_eval (80c78b4, 80b761c) + 1c2d >>>>> 0805dfeb eval_node (80c78b4, 80b761c) + 3f >>>>> 0805e6d6 ruby_exec_internal (808a254, 80c8c50, 0, 0, 0, a00000) + d6 >>>>> 0805e74b ruby_exec (808a254, 8058cbe, 8047d4c, 8058cc6, feffb7e4, >>>>> feffde70) >>>>> + 27 >>>>> 0805e773 ruby_run (feffb7e4, feffde70, 0, 8047d4c, 8076d8d, feffb7e4) + >>>>> 23 >>>>> 08058cc6 main (5, 8047d74, 8047d8c) + 3a >>>>> 08058bfe _start (5, 8047e30, 8047e47, 8047e67, 8047e6a, 8047e75) + 7a >>>>> >>>>> _______________________________________________ >>>>> Eventmachine-talk mailing list >>>>> Eventmachine-talk at rubyforge.org >>>>> http://rubyforge.org/mailman/listinfo/eventmachine-talk >>>>> >>>>> >>>>> >>>> _______________________________________________ >>>> Eventmachine-talk mailing list >>>> Eventmachine-talk at rubyforge.org >>>> http://rubyforge.org/mailman/listinfo/eventmachine-talk >>>> >>>> >>> _______________________________________________ >>> Eventmachine-talk mailing list >>> Eventmachine-talk at rubyforge.org >>> http://rubyforge.org/mailman/listinfo/eventmachine-talk >>> >>> > _______________________________________________ > Eventmachine-talk mailing list > Eventmachine-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/eventmachine-talk >
Aman Gupta
2009-Jan-21 11:42 UTC
[Eventmachine-talk] thin occasionally aborts in eventmachine on
Francis and I just spent some time investigating the issue and I think we''ve finally solved it. This should also fix the long-standing problem in the kqueue reactor. Basically, the problem arises when send() fails with an EWOULDBLOCK or EINPROGRESS. The data is not sent, but its also not put back into OutboundPages, which causes OutboundDataSize to get out of sync and triggers the assertion failure. The fix is to always put unsent data back into the outbound buffer: ==================================================================--- ed.cpp (revision 674) +++ ed.cpp (working copy) @@ -609,31 +609,38 @@ assert (GetSocket() != INVALID_SOCKET); int bytes_written = send (GetSocket(), output_buffer, nbytes, 0); - if (bytes_written > 0) { - OutboundDataSize -= bytes_written; - if ((size_t)bytes_written < nbytes) { - int len = nbytes - bytes_written; - char *buffer = (char*) malloc (len + 1); - if (!buffer) - throw std::runtime_error ("bad alloc throwing back data"); - memcpy (buffer, output_buffer + bytes_written, len); - buffer [len] = 0; - OutboundPages.push_front (OutboundPage (buffer, len)); - } + bool err = false; + if (bytes_written < 0) { + err = true; + bytes_written = 0; + } - #ifdef HAVE_EPOLL - EpollEvent.events = (EPOLLIN | (SelectForWrite() ? EPOLLOUT : 0)); - assert (MyEventMachine); - MyEventMachine->Modify (this); - #endif - #ifdef HAVE_KQUEUE - if (SelectForWrite()) { - MyEventMachine->ArmKqueueWriter (this); - cerr << "POW\n"; - } - #endif + assert (bytes_written >= 0); + OutboundDataSize -= bytes_written; + if ((size_t)bytes_written < nbytes) { + int len = nbytes - bytes_written; + char *buffer = (char*) malloc (len + 1); + if (!buffer) + throw std::runtime_error ("bad alloc throwing back data"); + memcpy (buffer, output_buffer + bytes_written, len); + buffer [len] = 0; + OutboundPages.push_front (OutboundPage (buffer, len)); } - else { + + + #ifdef HAVE_EPOLL + EpollEvent.events = (EPOLLIN | (SelectForWrite() ? EPOLLOUT : 0)); + assert (MyEventMachine); + MyEventMachine->Modify (this); + #endif + #ifdef HAVE_KQUEUE + if (SelectForWrite()) { + MyEventMachine->ArmKqueueWriter (this); + } + #endif + + + if (err) { #ifdef OS_UNIX if ((errno != EINPROGRESS) && (errno != EWOULDBLOCK) && (errno != EINTR)) #endif I''m going to test this patch further and commit it to the git repo later tonight. Aman On Wed, Jan 21, 2009 at 2:57 PM, Amanda waite <Amanda.Waite at sun.com> wrote:> Thanks, I''ve been testing with what you gave me yesterday and some logging > to indicate when the problem condition occurred and was only just able to > make it occur. It didn''t crash but I need to examine the logs and see if > anything else went wrong. I''ll give this patch a test too. > > Thanks for your help > > Amanda > > Aman Gupta wrote: >> >> I think I might have found the issue. Can you confirm if this fixes the >> failure: >> >> diff --git a/ext/ed.cpp b/ext/ed.cpp >> index 27e2d11..1f82443 100644 >> --- a/ext/ed.cpp >> +++ b/ext/ed.cpp >> @@ -624,7 +624,7 @@ void ConnectionDescriptor::_WriteOutboundData() >> assert (GetSocket() != INVALID_SOCKET); >> int bytes_written = send (GetSocket(), output_buffer, nbytes, 0); >> >> - if (bytes_written > 0) { >> + if (bytes_written >= 0) { >> OutboundDataSize -= bytes_written; >> if ((size_t)bytes_written < nbytes) { >> int len = nbytes - bytes_written; >> >> Aman >> >> On Tue, Jan 20, 2009 at 6:11 PM, Aman Gupta <themastermind1 at gmail.com> >> wrote: >> >>>>> >>>>> Can you check what the value of bConnectPending is when the error >>>>> occurs? >>>>> >>>> >>>> bConnectPending is false when the error occurs, both from running dbx on >>>> the >>>> core file and from >>>> some logging I added. bNotifyWritable is also false, if that helps. >>>> >>>> ed->Write was called from: >>>> >>>> if (FD_ISSET (sd, &(SelectData.fdwrites))) >>>> ed->Write(); >>>> >>>> Which is probably obvious, but I can''t see in this code path where a >>>> check >>>> is made for >>>> OutboundPages.size()>0 other than just before the assert that causes the >>>> abort. >>>> >>> >>> Before calling select, it adds the fd to the fdset if SelectForWrite() >>> is true (em.cpp:748): >>> >>> if (ed->SelectForWrite()) >>> FD_SET (sd, &(SelectData.fdwrites)); >>> >>> and in SelectForWrite, it checks for bConnectPending and Outbound >>> pages (ed.cpp:373): >>> >>> if (bConnectPending || bNotifyWritable) >>> return true; >>> else { >>> return (GetOutboundDataSize() > 0); >>> } >>> >>> I''m not quite sure why this function uses GetOutboundDataSize() >>> instead of OutboundPages.size(). Interestingly, >>> DatagramDescriptor::SelectForWrite() uses OutboundPages.size() >>> instead: >>> >>> //return (GetOutboundDataSize() > 0); (Original) >>> return (OutboundPages.size() > 0); >>> >>> So presumably a similar issue was found in the UDP code and fixed. >>> >>> GetOutboundDataSize() returns OutboundDataSize, so it sounds like >>> OutboundDataSize and OutboundPages.size() are getting out of sync >>> somewhere. Can you try changing SelectForWrite to return >>> (OutboundPages.size() > 0)? >>> >>> Aman >>> >>> >>>> >>>> >>>>> >>>>> You could also just put in a check at the beginning of >>>>> _WriteOutboundData(): >>>>> >>>>> if (OutboundPages.size() == 0) return; >>>>> >>>>> >>>> >>>> I''ll try this and will add some logging to show when it happens as it >>>> may >>>> not be obvious. >>>> >>>> Thanks to everyone for their advice so far. >>>> >>>> Amanda >>>> >>>>> >>>>> Aman >>>>> >>>>> On Tue, Jan 20, 2009 at 7:22 AM, Amanda Waite <Amanda.Waite at sun.com> >>>>> wrote: >>>>> >>>>> >>>>>> >>>>>> I''ve been testing with Thin 1.0.0 on OpenSolaris 2008.11. Every now >>>>>> and >>>>>> again and particularly under load, a thin will abort in eventmachine >>>>>> and >>>>>> will drop core. The error is caused by an assertion that in a call to >>>>>> ed->Write() there should be some data to write and there isn''t. The >>>>>> stack >>>>>> trace is at the end of this message. I''ve only been able to recreate >>>>>> this >>>>>> with a single thin, with about 35-40 concurrent users. The App is a >>>>>> Rails >>>>>> v2.1.0 application (http://incubator.apache.org/olio/). >>>>>> >>>>>> The error is: >>>>>> >>>>>> Assertion failed: nbytes > 0, file ed.cpp, line 622 >>>>>> Abort (core dumped) >>>>>> >>>>>> At the point that the failure happens, on entering >>>>>> ConnectionDescriptor::_WriteOutboundData(), OutboundPages.size() = 0, >>>>>> although in the only test where it aborted while I was monitoring it, >>>>>> OutboundDataSize was 16k >>>>>> >>>>>> The gem is built with g++ and I''ve tried turning down the optimisation >>>>>> from >>>>>> -O3 to none but it made no difference. >>>>>> >>>>>> Any ideas or suggestion on how to dig deeper into the problem? >>>>>> >>>>>> Thanks >>>>>> >>>>>> Amanda, >>>>>> >>>>>> Stack Trace: >>>>>> >>>>>> >>>>>> root at dn21:/export/faban/olio_rails/olioapp# pstack core >>>>>> core ''core'' of 15660: /usr/ruby/1.8/bin/ruby >>>>>> /var/ruby/1.8/gem_home/bin/thin -e production s >>>>>> fecb17c5 _lwp_kill (1, 6, 8034820, fec5a02e) + 15 >>>>>> fec5a03a raise (6, 0, 8034870, fec315ea) + 22 >>>>>> fec3160a abort (65737341, 6f697472, 6166206e, 64656c69, 626e203a, >>>>>> 73657479) + f2 >>>>>> fec3185a _assert (fea9225d, fea92220, 26e, fea830d6) + 82 >>>>>> fea832bf _ZN20ConnectionDescriptor18_WriteOutboundDataEv (a1abc40, >>>>>> 8038b18, >>>>>> 8038bdc, fea82fd1) + 1f9 >>>>>> fea830b8 _ZN20ConnectionDescriptor5WriteEv (a1abc40, 28, 8038b60, >>>>>> fea89597) >>>>>> + f4 >>>>>> fea89869 _ZN14EventMachine_t14_RunSelectOnceEv (9aa48c8, 85106, >>>>>> fecb0615, >>>>>> fea89254) + 2e1 >>>>>> fea892a4 _ZN14EventMachine_t8_RunOnceEv (9aa48c8, 8039254, 0, >>>>>> fea891d8) + >>>>>> 5c >>>>>> fea8922f _ZN14EventMachine_t3RunEv (9aa48c8, 808a254, 2, fea72745) + >>>>>> 63 >>>>>> fea7283e evma_run_machine (9639760, 808a254, 80391fc, 80681fc, >>>>>> 83b15c0, >>>>>> 808a254) + 106 >>>>>> fea7796c _Z29t_run_machine_without_threadsm (83b15c0) + 18 >>>>>> 080681fc rb_call0 (83b1480, 83b15c0, 5161, 5161, 0, 0) + 998 >>>>>> 08068dd2 rb_call (83b1480, 83b15c0, 5161, 0, 0, 2) + 196 >>>>>> 080627f9 rb_eval (83b15c0, 83c71b8) + 19ed >>>>>> 08063241 rb_eval (83b15c0, 83c7654) + 2435 >>>>>> 080686ca rb_call0 (83b1480, 83b15c0, 1411, 1411, 0, 0) + e66 >>>>>> 08068dd2 rb_call (83b1480, 83b15c0, 1411, 0, 0, 0) + 196 >>>>>> 08062c99 rb_eval (842937c, 8426f78) + 1e8d >>>>>> 0806ea95 block_pass (842937c, 8426fa0) + 3a1 >>>>>> 080636c8 rb_eval (842937c, 8427158) + 28bc >>>>>> 080686ca rb_call0 (84249e4, 842937c, 13c1, 13c1, 0, 0) + e66 >>>>>> 08068dd2 rb_call (84249e4, 842937c, 13c1, 0, 0, 0) + 196 >>>>>> 08062c99 rb_eval (84297f0, 82c1c28) + 1e8d >>>>>> 080686ca rb_call0 (82b9ba4, 84297f0, 13c1, 13c1, 0, 0) + e66 >>>>>> 08068dd2 rb_call (82b9ba4, 84297f0, 13c1, 0, 0, 0) + 196 >>>>>> 08062c99 rb_eval (8433098, 842fde4) + 1e8d >>>>>> 080686ca rb_call0 (8429840, 8433098, 13c1, 13c1, 0, 8041464) + e66 >>>>>> 08068dd2 rb_call (8429840, 8433098, 13c1, 0, 8041464, 1) + 196 >>>>>> 08068fbc rb_f_send (1, 8041460, 8433098) + f4 >>>>>> 0806820d rb_call0 (80c8bb0, 8433098, fd1, fd1, 1, 8041460) + 9a9 >>>>>> 08068dd2 rb_call (80c8bb0, 8433098, fd1, 1, 8041460, 0) + 196 >>>>>> 08062c99 rb_eval (82ce4b4, 8360abc) + 1e8d >>>>>> 0806344a rb_eval (82ce4b4, 83612dc) + 263e >>>>>> 080686ca rb_call0 (82cda78, 82ce4b4, 6351, 6351, 0, 0) + e66 >>>>>> 08068dd2 rb_call (82cda78, 82ce4b4, 6351, 0, 0, 2) + 196 >>>>>> 080627f9 rb_eval (82ce4b4, 8361534) + 19ed >>>>>> 080686ca rb_call0 (82cda78, 82ce4b4, 4ecf, 4ecf, 0, 0) + e66 >>>>>> 08068dd2 rb_call (82cda78, 82ce4b4, 4ecf, 0, 0, 0) + 196 >>>>>> 08062c99 rb_eval (80c78b4, 83ec9a4) + 1e8d >>>>>> 0805dfeb eval_node (80c78b4, 83ec9a4) + 3f >>>>>> 0806a819 rb_load (83ecbd4, 0) + 391 >>>>>> 0806aaf4 rb_f_load (1, 80469e0, 80c78b4) + 48 >>>>>> 0806820d rb_call0 (80c8bb0, 80c78b4, 25c1, 25c1, 1, 80469e0) + 9a9 >>>>>> 08068dd2 rb_call (80c8bb0, 80c78b4, 25c1, 1, 80469e0, 1) + 196 >>>>>> 08062a39 rb_eval (80c78b4, 80b761c) + 1c2d >>>>>> 0805dfeb eval_node (80c78b4, 80b761c) + 3f >>>>>> 0805e6d6 ruby_exec_internal (808a254, 80c8c50, 0, 0, 0, a00000) + d6 >>>>>> 0805e74b ruby_exec (808a254, 8058cbe, 8047d4c, 8058cc6, feffb7e4, >>>>>> feffde70) >>>>>> + 27 >>>>>> 0805e773 ruby_run (feffb7e4, feffde70, 0, 8047d4c, 8076d8d, feffb7e4) >>>>>> + >>>>>> 23 >>>>>> 08058cc6 main (5, 8047d74, 8047d8c) + 3a >>>>>> 08058bfe _start (5, 8047e30, 8047e47, 8047e67, 8047e6a, 8047e75) + >>>>>> 7a >>>>>> >>>>>> _______________________________________________ >>>>>> Eventmachine-talk mailing list >>>>>> Eventmachine-talk at rubyforge.org >>>>>> http://rubyforge.org/mailman/listinfo/eventmachine-talk >>>>>> >>>>>> >>>>>> >>>>> >>>>> _______________________________________________ >>>>> Eventmachine-talk mailing list >>>>> Eventmachine-talk at rubyforge.org >>>>> http://rubyforge.org/mailman/listinfo/eventmachine-talk >>>>> >>>>> >>>> >>>> _______________________________________________ >>>> Eventmachine-talk mailing list >>>> Eventmachine-talk at rubyforge.org >>>> http://rubyforge.org/mailman/listinfo/eventmachine-talk >>>> >>>> >> >> _______________________________________________ >> Eventmachine-talk mailing list >> Eventmachine-talk at rubyforge.org >> http://rubyforge.org/mailman/listinfo/eventmachine-talk >> > > _______________________________________________ > Eventmachine-talk mailing list > Eventmachine-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/eventmachine-talk >
Amanda Waite
2009-Jan-22 05:00 UTC
[Eventmachine-talk] thin occasionally aborts in eventmachine
I''ve patched my system with this and have now run it for 13 10 minute runs without any problems which gives me some confidence that the patch has fixed the issue. I''m going to try running with 200 users and 5 thins for a few runs. Thanks for your support Amanda Aman Gupta wrote:> Francis and I just spent some time investigating the issue and I think > we''ve finally solved it. This should also fix the long-standing > problem in the kqueue reactor. > > Basically, the problem arises when send() fails with an EWOULDBLOCK or > EINPROGRESS. The data is not sent, but its also not put back into > OutboundPages, which causes OutboundDataSize to get out of sync and > triggers the assertion failure. The fix is to always put unsent data > back into the outbound buffer: > > ==================================================================> --- ed.cpp (revision 674) > +++ ed.cpp (working copy) > @@ -609,31 +609,38 @@ > assert (GetSocket() != INVALID_SOCKET); > int bytes_written = send (GetSocket(), output_buffer, nbytes, 0); > > - if (bytes_written > 0) { > - OutboundDataSize -= bytes_written; > - if ((size_t)bytes_written < nbytes) { > - int len = nbytes - bytes_written; > - char *buffer = (char*) malloc (len + 1); > - if (!buffer) > - throw std::runtime_error ("bad alloc > throwing back data"); > - memcpy (buffer, output_buffer + bytes_written, len); > - buffer [len] = 0; > - OutboundPages.push_front (OutboundPage (buffer, len)); > - } > + bool err = false; > + if (bytes_written < 0) { > + err = true; > + bytes_written = 0; > + } > > - #ifdef HAVE_EPOLL > - EpollEvent.events = (EPOLLIN | (SelectForWrite() ? > EPOLLOUT : 0)); > - assert (MyEventMachine); > - MyEventMachine->Modify (this); > - #endif > - #ifdef HAVE_KQUEUE > - if (SelectForWrite()) { > - MyEventMachine->ArmKqueueWriter (this); > - cerr << "POW\n"; > - } > - #endif > + assert (bytes_written >= 0); > + OutboundDataSize -= bytes_written; > + if ((size_t)bytes_written < nbytes) { > + int len = nbytes - bytes_written; > + char *buffer = (char*) malloc (len + 1); > + if (!buffer) > + throw std::runtime_error ("bad alloc throwing > back data"); > + memcpy (buffer, output_buffer + bytes_written, len); > + buffer [len] = 0; > + OutboundPages.push_front (OutboundPage (buffer, len)); > } > - else { > + > + > + #ifdef HAVE_EPOLL > + EpollEvent.events = (EPOLLIN | (SelectForWrite() ? EPOLLOUT : 0)); > + assert (MyEventMachine); > + MyEventMachine->Modify (this); > + #endif > + #ifdef HAVE_KQUEUE > + if (SelectForWrite()) { > + MyEventMachine->ArmKqueueWriter (this); > + } > + #endif > + > + > + if (err) { > #ifdef OS_UNIX > if ((errno != EINPROGRESS) && (errno != EWOULDBLOCK) > && (errno != EINTR)) > #endif > > I''m going to test this patch further and commit it to the git repo > later tonight. > > Aman > > On Wed, Jan 21, 2009 at 2:57 PM, Amanda waite <Amanda.Waite at sun.com> wrote: > >> Thanks, I''ve been testing with what you gave me yesterday and some logging >> to indicate when the problem condition occurred and was only just able to >> make it occur. It didn''t crash but I need to examine the logs and see if >> anything else went wrong. I''ll give this patch a test too. >> >> Thanks for your help >> >> Amanda >> >> Aman Gupta wrote: >> >>> I think I might have found the issue. Can you confirm if this fixes the >>> failure: >>> >>> diff --git a/ext/ed.cpp b/ext/ed.cpp >>> index 27e2d11..1f82443 100644 >>> --- a/ext/ed.cpp >>> +++ b/ext/ed.cpp >>> @@ -624,7 +624,7 @@ void ConnectionDescriptor::_WriteOutboundData() >>> assert (GetSocket() != INVALID_SOCKET); >>> int bytes_written = send (GetSocket(), output_buffer, nbytes, 0); >>> >>> - if (bytes_written > 0) { >>> + if (bytes_written >= 0) { >>> OutboundDataSize -= bytes_written; >>> if ((size_t)bytes_written < nbytes) { >>> int len = nbytes - bytes_written; >>> >>> Aman >>> >>> On Tue, Jan 20, 2009 at 6:11 PM, Aman Gupta <themastermind1 at gmail.com> >>> wrote: >>> >>> >>>>>> Can you check what the value of bConnectPending is when the error >>>>>> occurs? >>>>>> >>>>>> >>>>> bConnectPending is false when the error occurs, both from running dbx on >>>>> the >>>>> core file and from >>>>> some logging I added. bNotifyWritable is also false, if that helps. >>>>> >>>>> ed->Write was called from: >>>>> >>>>> if (FD_ISSET (sd, &(SelectData.fdwrites))) >>>>> ed->Write(); >>>>> >>>>> Which is probably obvious, but I can''t see in this code path where a >>>>> check >>>>> is made for >>>>> OutboundPages.size()>0 other than just before the assert that causes the >>>>> abort. >>>>> >>>>> >>>> Before calling select, it adds the fd to the fdset if SelectForWrite() >>>> is true (em.cpp:748): >>>> >>>> if (ed->SelectForWrite()) >>>> FD_SET (sd, &(SelectData.fdwrites)); >>>> >>>> and in SelectForWrite, it checks for bConnectPending and Outbound >>>> pages (ed.cpp:373): >>>> >>>> if (bConnectPending || bNotifyWritable) >>>> return true; >>>> else { >>>> return (GetOutboundDataSize() > 0); >>>> } >>>> >>>> I''m not quite sure why this function uses GetOutboundDataSize() >>>> instead of OutboundPages.size(). Interestingly, >>>> DatagramDescriptor::SelectForWrite() uses OutboundPages.size() >>>> instead: >>>> >>>> //return (GetOutboundDataSize() > 0); (Original) >>>> return (OutboundPages.size() > 0); >>>> >>>> So presumably a similar issue was found in the UDP code and fixed. >>>> >>>> GetOutboundDataSize() returns OutboundDataSize, so it sounds like >>>> OutboundDataSize and OutboundPages.size() are getting out of sync >>>> somewhere. Can you try changing SelectForWrite to return >>>> (OutboundPages.size() > 0)? >>>> >>>> Aman >>>> >>>> >>>> >>>>> >>>>>> You could also just put in a check at the beginning of >>>>>> _WriteOutboundData(): >>>>>> >>>>>> if (OutboundPages.size() == 0) return; >>>>>> >>>>>> >>>>>> >>>>> I''ll try this and will add some logging to show when it happens as it >>>>> may >>>>> not be obvious. >>>>> >>>>> Thanks to everyone for their advice so far. >>>>> >>>>> Amanda >>>>> >>>>> >>>>>> Aman >>>>>> >>>>>> On Tue, Jan 20, 2009 at 7:22 AM, Amanda Waite <Amanda.Waite at sun.com> >>>>>> wrote: >>>>>> >>>>>> >>>>>> >>>>>>> I''ve been testing with Thin 1.0.0 on OpenSolaris 2008.11. Every now >>>>>>> and >>>>>>> again and particularly under load, a thin will abort in eventmachine >>>>>>> and >>>>>>> will drop core. The error is caused by an assertion that in a call to >>>>>>> ed->Write() there should be some data to write and there isn''t. The >>>>>>> stack >>>>>>> trace is at the end of this message. I''ve only been able to recreate >>>>>>> this >>>>>>> with a single thin, with about 35-40 concurrent users. The App is a >>>>>>> Rails >>>>>>> v2.1.0 application (http://incubator.apache.org/olio/). >>>>>>> >>>>>>> The error is: >>>>>>> >>>>>>> Assertion failed: nbytes > 0, file ed.cpp, line 622 >>>>>>> Abort (core dumped) >>>>>>> >>>>>>> At the point that the failure happens, on entering >>>>>>> ConnectionDescriptor::_WriteOutboundData(), OutboundPages.size() = 0, >>>>>>> although in the only test where it aborted while I was monitoring it, >>>>>>> OutboundDataSize was 16k >>>>>>> >>>>>>> The gem is built with g++ and I''ve tried turning down the optimisation >>>>>>> from >>>>>>> -O3 to none but it made no difference. >>>>>>> >>>>>>> Any ideas or suggestion on how to dig deeper into the problem? >>>>>>> >>>>>>> Thanks >>>>>>> >>>>>>> Amanda, >>>>>>> >>>>>>> Stack Trace: >>>>>>> >>>>>>> >>>>>>> root at dn21:/export/faban/olio_rails/olioapp# pstack core >>>>>>> core ''core'' of 15660: /usr/ruby/1.8/bin/ruby >>>>>>> /var/ruby/1.8/gem_home/bin/thin -e production s >>>>>>> fecb17c5 _lwp_kill (1, 6, 8034820, fec5a02e) + 15 >>>>>>> fec5a03a raise (6, 0, 8034870, fec315ea) + 22 >>>>>>> fec3160a abort (65737341, 6f697472, 6166206e, 64656c69, 626e203a, >>>>>>> 73657479) + f2 >>>>>>> fec3185a _assert (fea9225d, fea92220, 26e, fea830d6) + 82 >>>>>>> fea832bf _ZN20ConnectionDescriptor18_WriteOutboundDataEv (a1abc40, >>>>>>> 8038b18, >>>>>>> 8038bdc, fea82fd1) + 1f9 >>>>>>> fea830b8 _ZN20ConnectionDescriptor5WriteEv (a1abc40, 28, 8038b60, >>>>>>> fea89597) >>>>>>> + f4 >>>>>>> fea89869 _ZN14EventMachine_t14_RunSelectOnceEv (9aa48c8, 85106, >>>>>>> fecb0615, >>>>>>> fea89254) + 2e1 >>>>>>> fea892a4 _ZN14EventMachine_t8_RunOnceEv (9aa48c8, 8039254, 0, >>>>>>> fea891d8) + >>>>>>> 5c >>>>>>> fea8922f _ZN14EventMachine_t3RunEv (9aa48c8, 808a254, 2, fea72745) + >>>>>>> 63 >>>>>>> fea7283e evma_run_machine (9639760, 808a254, 80391fc, 80681fc, >>>>>>> 83b15c0, >>>>>>> 808a254) + 106 >>>>>>> fea7796c _Z29t_run_machine_without_threadsm (83b15c0) + 18 >>>>>>> 080681fc rb_call0 (83b1480, 83b15c0, 5161, 5161, 0, 0) + 998 >>>>>>> 08068dd2 rb_call (83b1480, 83b15c0, 5161, 0, 0, 2) + 196 >>>>>>> 080627f9 rb_eval (83b15c0, 83c71b8) + 19ed >>>>>>> 08063241 rb_eval (83b15c0, 83c7654) + 2435 >>>>>>> 080686ca rb_call0 (83b1480, 83b15c0, 1411, 1411, 0, 0) + e66 >>>>>>> 08068dd2 rb_call (83b1480, 83b15c0, 1411, 0, 0, 0) + 196 >>>>>>> 08062c99 rb_eval (842937c, 8426f78) + 1e8d >>>>>>> 0806ea95 block_pass (842937c, 8426fa0) + 3a1 >>>>>>> 080636c8 rb_eval (842937c, 8427158) + 28bc >>>>>>> 080686ca rb_call0 (84249e4, 842937c, 13c1, 13c1, 0, 0) + e66 >>>>>>> 08068dd2 rb_call (84249e4, 842937c, 13c1, 0, 0, 0) + 196 >>>>>>> 08062c99 rb_eval (84297f0, 82c1c28) + 1e8d >>>>>>> 080686ca rb_call0 (82b9ba4, 84297f0, 13c1, 13c1, 0, 0) + e66 >>>>>>> 08068dd2 rb_call (82b9ba4, 84297f0, 13c1, 0, 0, 0) + 196 >>>>>>> 08062c99 rb_eval (8433098, 842fde4) + 1e8d >>>>>>> 080686ca rb_call0 (8429840, 8433098, 13c1, 13c1, 0, 8041464) + e66 >>>>>>> 08068dd2 rb_call (8429840, 8433098, 13c1, 0, 8041464, 1) + 196 >>>>>>> 08068fbc rb_f_send (1, 8041460, 8433098) + f4 >>>>>>> 0806820d rb_call0 (80c8bb0, 8433098, fd1, fd1, 1, 8041460) + 9a9 >>>>>>> 08068dd2 rb_call (80c8bb0, 8433098, fd1, 1, 8041460, 0) + 196 >>>>>>> 08062c99 rb_eval (82ce4b4, 8360abc) + 1e8d >>>>>>> 0806344a rb_eval (82ce4b4, 83612dc) + 263e >>>>>>> 080686ca rb_call0 (82cda78, 82ce4b4, 6351, 6351, 0, 0) + e66 >>>>>>> 08068dd2 rb_call (82cda78, 82ce4b4, 6351, 0, 0, 2) + 196 >>>>>>> 080627f9 rb_eval (82ce4b4, 8361534) + 19ed >>>>>>> 080686ca rb_call0 (82cda78, 82ce4b4, 4ecf, 4ecf, 0, 0) + e66 >>>>>>> 08068dd2 rb_call (82cda78, 82ce4b4, 4ecf, 0, 0, 0) + 196 >>>>>>> 08062c99 rb_eval (80c78b4, 83ec9a4) + 1e8d >>>>>>> 0805dfeb eval_node (80c78b4, 83ec9a4) + 3f >>>>>>> 0806a819 rb_load (83ecbd4, 0) + 391 >>>>>>> 0806aaf4 rb_f_load (1, 80469e0, 80c78b4) + 48 >>>>>>> 0806820d rb_call0 (80c8bb0, 80c78b4, 25c1, 25c1, 1, 80469e0) + 9a9 >>>>>>> 08068dd2 rb_call (80c8bb0, 80c78b4, 25c1, 1, 80469e0, 1) + 196 >>>>>>> 08062a39 rb_eval (80c78b4, 80b761c) + 1c2d >>>>>>> 0805dfeb eval_node (80c78b4, 80b761c) + 3f >>>>>>> 0805e6d6 ruby_exec_internal (808a254, 80c8c50, 0, 0, 0, a00000) + d6 >>>>>>> 0805e74b ruby_exec (808a254, 8058cbe, 8047d4c, 8058cc6, feffb7e4, >>>>>>> feffde70) >>>>>>> + 27 >>>>>>> 0805e773 ruby_run (feffb7e4, feffde70, 0, 8047d4c, 8076d8d, feffb7e4) >>>>>>> + >>>>>>> 23 >>>>>>> 08058cc6 main (5, 8047d74, 8047d8c) + 3a >>>>>>> 08058bfe _start (5, 8047e30, 8047e47, 8047e67, 8047e6a, 8047e75) + >>>>>>> 7a >>>>>>> >>>>>>> _______________________________________________ >>>>>>> Eventmachine-talk mailing list >>>>>>> Eventmachine-talk at rubyforge.org >>>>>>> http://rubyforge.org/mailman/listinfo/eventmachine-talk >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>> _______________________________________________ >>>>>> Eventmachine-talk mailing list >>>>>> Eventmachine-talk at rubyforge.org >>>>>> http://rubyforge.org/mailman/listinfo/eventmachine-talk >>>>>> >>>>>> >>>>>> >>>>> _______________________________________________ >>>>> Eventmachine-talk mailing list >>>>> Eventmachine-talk at rubyforge.org >>>>> http://rubyforge.org/mailman/listinfo/eventmachine-talk >>>>> >>>>> >>>>> >>> _______________________________________________ >>> Eventmachine-talk mailing list >>> Eventmachine-talk at rubyforge.org >>> http://rubyforge.org/mailman/listinfo/eventmachine-talk >>> >>> >> _______________________________________________ >> Eventmachine-talk mailing list >> Eventmachine-talk at rubyforge.org >> http://rubyforge.org/mailman/listinfo/eventmachine-talk >> >> > _______________________________________________ > Eventmachine-talk mailing list > Eventmachine-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/eventmachine-talk >
Amanda waite
2009-Jan-25 12:31 UTC
[Eventmachine-talk] thin occasionally aborts in eventmachine
With this patch, I''ve now tested with 20 odd runs with up to 8 thins and up to 260 concurrent users and I''ve not had a single crash and better not a still a single error on any operations. Looks like the patch fixes the problem. I''ll carry on testing and will let you know if I see anything odd. Thanks Amanda Aman Gupta wrote:> Francis and I just spent some time investigating the issue and I think > we''ve finally solved it. This should also fix the long-standing > problem in the kqueue reactor. > > Basically, the problem arises when send() fails with an EWOULDBLOCK or > EINPROGRESS. The data is not sent, but its also not put back into > OutboundPages, which causes OutboundDataSize to get out of sync and > triggers the assertion failure. The fix is to always put unsent data > back into the outbound buffer: > > ==================================================================> --- ed.cpp (revision 674) > +++ ed.cpp (working copy) > @@ -609,31 +609,38 @@ > assert (GetSocket() != INVALID_SOCKET); > int bytes_written = send (GetSocket(), output_buffer, nbytes, 0); > > - if (bytes_written > 0) { > - OutboundDataSize -= bytes_written; > - if ((size_t)bytes_written < nbytes) { > - int len = nbytes - bytes_written; > - char *buffer = (char*) malloc (len + 1); > - if (!buffer) > - throw std::runtime_error ("bad alloc > throwing back data"); > - memcpy (buffer, output_buffer + bytes_written, len); > - buffer [len] = 0; > - OutboundPages.push_front (OutboundPage (buffer, len)); > - } > + bool err = false; > + if (bytes_written < 0) { > + err = true; > + bytes_written = 0; > + } > > - #ifdef HAVE_EPOLL > - EpollEvent.events = (EPOLLIN | (SelectForWrite() ? > EPOLLOUT : 0)); > - assert (MyEventMachine); > - MyEventMachine->Modify (this); > - #endif > - #ifdef HAVE_KQUEUE > - if (SelectForWrite()) { > - MyEventMachine->ArmKqueueWriter (this); > - cerr << "POW\n"; > - } > - #endif > + assert (bytes_written >= 0); > + OutboundDataSize -= bytes_written; > + if ((size_t)bytes_written < nbytes) { > + int len = nbytes - bytes_written; > + char *buffer = (char*) malloc (len + 1); > + if (!buffer) > + throw std::runtime_error ("bad alloc throwing > back data"); > + memcpy (buffer, output_buffer + bytes_written, len); > + buffer [len] = 0; > + OutboundPages.push_front (OutboundPage (buffer, len)); > } > - else { > + > + > + #ifdef HAVE_EPOLL > + EpollEvent.events = (EPOLLIN | (SelectForWrite() ? EPOLLOUT : 0)); > + assert (MyEventMachine); > + MyEventMachine->Modify (this); > + #endif > + #ifdef HAVE_KQUEUE > + if (SelectForWrite()) { > + MyEventMachine->ArmKqueueWriter (this); > + } > + #endif > + > + > + if (err) { > #ifdef OS_UNIX > if ((errno != EINPROGRESS) && (errno != EWOULDBLOCK) > && (errno != EINTR)) > #endif > > I''m going to test this patch further and commit it to the git repo > later tonight. > > Aman > > On Wed, Jan 21, 2009 at 2:57 PM, Amanda waite <Amanda.Waite at sun.com> wrote: > >> Thanks, I''ve been testing with what you gave me yesterday and some logging >> to indicate when the problem condition occurred and was only just able to >> make it occur. It didn''t crash but I need to examine the logs and see if >> anything else went wrong. I''ll give this patch a test too. >> >> Thanks for your help >> >> Amanda >> >> Aman Gupta wrote: >> >>> I think I might have found the issue. Can you confirm if this fixes the >>> failure: >>> >>> diff --git a/ext/ed.cpp b/ext/ed.cpp >>> index 27e2d11..1f82443 100644 >>> --- a/ext/ed.cpp >>> +++ b/ext/ed.cpp >>> @@ -624,7 +624,7 @@ void ConnectionDescriptor::_WriteOutboundData() >>> assert (GetSocket() != INVALID_SOCKET); >>> int bytes_written = send (GetSocket(), output_buffer, nbytes, 0); >>> >>> - if (bytes_written > 0) { >>> + if (bytes_written >= 0) { >>> OutboundDataSize -= bytes_written; >>> if ((size_t)bytes_written < nbytes) { >>> int len = nbytes - bytes_written; >>> >>> Aman >>> >>> On Tue, Jan 20, 2009 at 6:11 PM, Aman Gupta <themastermind1 at gmail.com> >>> wrote: >>> >>> >>>>>> Can you check what the value of bConnectPending is when the error >>>>>> occurs? >>>>>> >>>>>> >>>>> bConnectPending is false when the error occurs, both from running dbx on >>>>> the >>>>> core file and from >>>>> some logging I added. bNotifyWritable is also false, if that helps. >>>>> >>>>> ed->Write was called from: >>>>> >>>>> if (FD_ISSET (sd, &(SelectData.fdwrites))) >>>>> ed->Write(); >>>>> >>>>> Which is probably obvious, but I can''t see in this code path where a >>>>> check >>>>> is made for >>>>> OutboundPages.size()>0 other than just before the assert that causes the >>>>> abort. >>>>> >>>>> >>>> Before calling select, it adds the fd to the fdset if SelectForWrite() >>>> is true (em.cpp:748): >>>> >>>> if (ed->SelectForWrite()) >>>> FD_SET (sd, &(SelectData.fdwrites)); >>>> >>>> and in SelectForWrite, it checks for bConnectPending and Outbound >>>> pages (ed.cpp:373): >>>> >>>> if (bConnectPending || bNotifyWritable) >>>> return true; >>>> else { >>>> return (GetOutboundDataSize() > 0); >>>> } >>>> >>>> I''m not quite sure why this function uses GetOutboundDataSize() >>>> instead of OutboundPages.size(). Interestingly, >>>> DatagramDescriptor::SelectForWrite() uses OutboundPages.size() >>>> instead: >>>> >>>> //return (GetOutboundDataSize() > 0); (Original) >>>> return (OutboundPages.size() > 0); >>>> >>>> So presumably a similar issue was found in the UDP code and fixed. >>>> >>>> GetOutboundDataSize() returns OutboundDataSize, so it sounds like >>>> OutboundDataSize and OutboundPages.size() are getting out of sync >>>> somewhere. Can you try changing SelectForWrite to return >>>> (OutboundPages.size() > 0)? >>>> >>>> Aman >>>> >>>> >>>> >>>>> >>>>>> You could also just put in a check at the beginning of >>>>>> _WriteOutboundData(): >>>>>> >>>>>> if (OutboundPages.size() == 0) return; >>>>>> >>>>>> >>>>>> >>>>> I''ll try this and will add some logging to show when it happens as it >>>>> may >>>>> not be obvious. >>>>> >>>>> Thanks to everyone for their advice so far. >>>>> >>>>> Amanda >>>>> >>>>> >>>>>> Aman >>>>>> >>>>>> On Tue, Jan 20, 2009 at 7:22 AM, Amanda Waite <Amanda.Waite at sun.com> >>>>>> wrote: >>>>>> >>>>>> >>>>>> >>>>>>> I''ve been testing with Thin 1.0.0 on OpenSolaris 2008.11. Every now >>>>>>> and >>>>>>> again and particularly under load, a thin will abort in eventmachine >>>>>>> and >>>>>>> will drop core. The error is caused by an assertion that in a call to >>>>>>> ed->Write() there should be some data to write and there isn''t. The >>>>>>> stack >>>>>>> trace is at the end of this message. I''ve only been able to recreate >>>>>>> this >>>>>>> with a single thin, with about 35-40 concurrent users. The App is a >>>>>>> Rails >>>>>>> v2.1.0 application (http://incubator.apache.org/olio/). >>>>>>> >>>>>>> The error is: >>>>>>> >>>>>>> Assertion failed: nbytes > 0, file ed.cpp, line 622 >>>>>>> Abort (core dumped) >>>>>>> >>>>>>> At the point that the failure happens, on entering >>>>>>> ConnectionDescriptor::_WriteOutboundData(), OutboundPages.size() = 0, >>>>>>> although in the only test where it aborted while I was monitoring it, >>>>>>> OutboundDataSize was 16k >>>>>>> >>>>>>> The gem is built with g++ and I''ve tried turning down the optimisation >>>>>>> from >>>>>>> -O3 to none but it made no difference. >>>>>>> >>>>>>> Any ideas or suggestion on how to dig deeper into the problem? >>>>>>> >>>>>>> Thanks >>>>>>> >>>>>>> Amanda, >>>>>>> >>>>>>> Stack Trace: >>>>>>> >>>>>>> >>>>>>> root at dn21:/export/faban/olio_rails/olioapp# pstack core >>>>>>> core ''core'' of 15660: /usr/ruby/1.8/bin/ruby >>>>>>> /var/ruby/1.8/gem_home/bin/thin -e production s >>>>>>> fecb17c5 _lwp_kill (1, 6, 8034820, fec5a02e) + 15 >>>>>>> fec5a03a raise (6, 0, 8034870, fec315ea) + 22 >>>>>>> fec3160a abort (65737341, 6f697472, 6166206e, 64656c69, 626e203a, >>>>>>> 73657479) + f2 >>>>>>> fec3185a _assert (fea9225d, fea92220, 26e, fea830d6) + 82 >>>>>>> fea832bf _ZN20ConnectionDescriptor18_WriteOutboundDataEv (a1abc40, >>>>>>> 8038b18, >>>>>>> 8038bdc, fea82fd1) + 1f9 >>>>>>> fea830b8 _ZN20ConnectionDescriptor5WriteEv (a1abc40, 28, 8038b60, >>>>>>> fea89597) >>>>>>> + f4 >>>>>>> fea89869 _ZN14EventMachine_t14_RunSelectOnceEv (9aa48c8, 85106, >>>>>>> fecb0615, >>>>>>> fea89254) + 2e1 >>>>>>> fea892a4 _ZN14EventMachine_t8_RunOnceEv (9aa48c8, 8039254, 0, >>>>>>> fea891d8) + >>>>>>> 5c >>>>>>> fea8922f _ZN14EventMachine_t3RunEv (9aa48c8, 808a254, 2, fea72745) + >>>>>>> 63 >>>>>>> fea7283e evma_run_machine (9639760, 808a254, 80391fc, 80681fc, >>>>>>> 83b15c0, >>>>>>> 808a254) + 106 >>>>>>> fea7796c _Z29t_run_machine_without_threadsm (83b15c0) + 18 >>>>>>> 080681fc rb_call0 (83b1480, 83b15c0, 5161, 5161, 0, 0) + 998 >>>>>>> 08068dd2 rb_call (83b1480, 83b15c0, 5161, 0, 0, 2) + 196 >>>>>>> 080627f9 rb_eval (83b15c0, 83c71b8) + 19ed >>>>>>> 08063241 rb_eval (83b15c0, 83c7654) + 2435 >>>>>>> 080686ca rb_call0 (83b1480, 83b15c0, 1411, 1411, 0, 0) + e66 >>>>>>> 08068dd2 rb_call (83b1480, 83b15c0, 1411, 0, 0, 0) + 196 >>>>>>> 08062c99 rb_eval (842937c, 8426f78) + 1e8d >>>>>>> 0806ea95 block_pass (842937c, 8426fa0) + 3a1 >>>>>>> 080636c8 rb_eval (842937c, 8427158) + 28bc >>>>>>> 080686ca rb_call0 (84249e4, 842937c, 13c1, 13c1, 0, 0) + e66 >>>>>>> 08068dd2 rb_call (84249e4, 842937c, 13c1, 0, 0, 0) + 196 >>>>>>> 08062c99 rb_eval (84297f0, 82c1c28) + 1e8d >>>>>>> 080686ca rb_call0 (82b9ba4, 84297f0, 13c1, 13c1, 0, 0) + e66 >>>>>>> 08068dd2 rb_call (82b9ba4, 84297f0, 13c1, 0, 0, 0) + 196 >>>>>>> 08062c99 rb_eval (8433098, 842fde4) + 1e8d >>>>>>> 080686ca rb_call0 (8429840, 8433098, 13c1, 13c1, 0, 8041464) + e66 >>>>>>> 08068dd2 rb_call (8429840, 8433098, 13c1, 0, 8041464, 1) + 196 >>>>>>> 08068fbc rb_f_send (1, 8041460, 8433098) + f4 >>>>>>> 0806820d rb_call0 (80c8bb0, 8433098, fd1, fd1, 1, 8041460) + 9a9 >>>>>>> 08068dd2 rb_call (80c8bb0, 8433098, fd1, 1, 8041460, 0) + 196 >>>>>>> 08062c99 rb_eval (82ce4b4, 8360abc) + 1e8d >>>>>>> 0806344a rb_eval (82ce4b4, 83612dc) + 263e >>>>>>> 080686ca rb_call0 (82cda78, 82ce4b4, 6351, 6351, 0, 0) + e66 >>>>>>> 08068dd2 rb_call (82cda78, 82ce4b4, 6351, 0, 0, 2) + 196 >>>>>>> 080627f9 rb_eval (82ce4b4, 8361534) + 19ed >>>>>>> 080686ca rb_call0 (82cda78, 82ce4b4, 4ecf, 4ecf, 0, 0) + e66 >>>>>>> 08068dd2 rb_call (82cda78, 82ce4b4, 4ecf, 0, 0, 0) + 196 >>>>>>> 08062c99 rb_eval (80c78b4, 83ec9a4) + 1e8d >>>>>>> 0805dfeb eval_node (80c78b4, 83ec9a4) + 3f >>>>>>> 0806a819 rb_load (83ecbd4, 0) + 391 >>>>>>> 0806aaf4 rb_f_load (1, 80469e0, 80c78b4) + 48 >>>>>>> 0806820d rb_call0 (80c8bb0, 80c78b4, 25c1, 25c1, 1, 80469e0) + 9a9 >>>>>>> 08068dd2 rb_call (80c8bb0, 80c78b4, 25c1, 1, 80469e0, 1) + 196 >>>>>>> 08062a39 rb_eval (80c78b4, 80b761c) + 1c2d >>>>>>> 0805dfeb eval_node (80c78b4, 80b761c) + 3f >>>>>>> 0805e6d6 ruby_exec_internal (808a254, 80c8c50, 0, 0, 0, a00000) + d6 >>>>>>> 0805e74b ruby_exec (808a254, 8058cbe, 8047d4c, 8058cc6, feffb7e4, >>>>>>> feffde70) >>>>>>> + 27 >>>>>>> 0805e773 ruby_run (feffb7e4, feffde70, 0, 8047d4c, 8076d8d, feffb7e4) >>>>>>> + >>>>>>> 23 >>>>>>> 08058cc6 main (5, 8047d74, 8047d8c) + 3a >>>>>>> 08058bfe _start (5, 8047e30, 8047e47, 8047e67, 8047e6a, 8047e75) + >>>>>>> 7a >>>>>>> >>>>>>> _______________________________________________ >>>>>>> Eventmachine-talk mailing list >>>>>>> Eventmachine-talk at rubyforge.org >>>>>>> http://rubyforge.org/mailman/listinfo/eventmachine-talk >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>> _______________________________________________ >>>>>> Eventmachine-talk mailing list >>>>>> Eventmachine-talk at rubyforge.org >>>>>> http://rubyforge.org/mailman/listinfo/eventmachine-talk >>>>>> >>>>>> >>>>>> >>>>> _______________________________________________ >>>>> Eventmachine-talk mailing list >>>>> Eventmachine-talk at rubyforge.org >>>>> http://rubyforge.org/mailman/listinfo/eventmachine-talk >>>>> >>>>> >>>>> >>> _______________________________________________ >>> Eventmachine-talk mailing list >>> Eventmachine-talk at rubyforge.org >>> http://rubyforge.org/mailman/listinfo/eventmachine-talk >>> >>> >> _______________________________________________ >> Eventmachine-talk mailing list >> Eventmachine-talk at rubyforge.org >> http://rubyforge.org/mailman/listinfo/eventmachine-talk >> >> > _______________________________________________ > Eventmachine-talk mailing list > Eventmachine-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/eventmachine-talk >
Aman Gupta
2009-Jan-25 12:58 UTC
[Eventmachine-talk] thin occasionally aborts in eventmachine on
Awesome. This patch was pushed to git and will make it into our next release. Thanks, Aman On Sun, Jan 25, 2009 at 4:31 PM, Amanda waite <Amanda.Waite at sun.com> wrote:> With this patch, I''ve now tested with 20 odd runs with up to 8 thins and up > to 260 concurrent users and I''ve not had a single crash and better not a > still a single error on any operations. > > Looks like the patch fixes the problem. I''ll carry on testing and will let > you know if I see anything odd. > > Thanks > > Amanda > > Aman Gupta wrote: >> >> Francis and I just spent some time investigating the issue and I think >> we''ve finally solved it. This should also fix the long-standing >> problem in the kqueue reactor. >> >> Basically, the problem arises when send() fails with an EWOULDBLOCK or >> EINPROGRESS. The data is not sent, but its also not put back into >> OutboundPages, which causes OutboundDataSize to get out of sync and >> triggers the assertion failure. The fix is to always put unsent data >> back into the outbound buffer: >> >> ==================================================================>> --- ed.cpp (revision 674) >> +++ ed.cpp (working copy) >> @@ -609,31 +609,38 @@ >> assert (GetSocket() != INVALID_SOCKET); >> int bytes_written = send (GetSocket(), output_buffer, nbytes, 0); >> >> - if (bytes_written > 0) { >> - OutboundDataSize -= bytes_written; >> - if ((size_t)bytes_written < nbytes) { >> - int len = nbytes - bytes_written; >> - char *buffer = (char*) malloc (len + 1); >> - if (!buffer) >> - throw std::runtime_error ("bad alloc >> throwing back data"); >> - memcpy (buffer, output_buffer + bytes_written, >> len); >> - buffer [len] = 0; >> - OutboundPages.push_front (OutboundPage (buffer, >> len)); >> - } >> + bool err = false; >> + if (bytes_written < 0) { >> + err = true; >> + bytes_written = 0; >> + } >> >> - #ifdef HAVE_EPOLL >> - EpollEvent.events = (EPOLLIN | (SelectForWrite() ? >> EPOLLOUT : 0)); >> - assert (MyEventMachine); >> - MyEventMachine->Modify (this); >> - #endif >> - #ifdef HAVE_KQUEUE >> - if (SelectForWrite()) { >> - MyEventMachine->ArmKqueueWriter (this); >> - cerr << "POW\n"; >> - } >> - #endif >> + assert (bytes_written >= 0); >> + OutboundDataSize -= bytes_written; >> + if ((size_t)bytes_written < nbytes) { >> + int len = nbytes - bytes_written; >> + char *buffer = (char*) malloc (len + 1); >> + if (!buffer) >> + throw std::runtime_error ("bad alloc throwing >> back data"); >> + memcpy (buffer, output_buffer + bytes_written, len); >> + buffer [len] = 0; >> + OutboundPages.push_front (OutboundPage (buffer, len)); >> } >> - else { >> + >> + >> + #ifdef HAVE_EPOLL >> + EpollEvent.events = (EPOLLIN | (SelectForWrite() ? EPOLLOUT : 0)); >> + assert (MyEventMachine); >> + MyEventMachine->Modify (this); >> + #endif >> + #ifdef HAVE_KQUEUE >> + if (SelectForWrite()) { >> + MyEventMachine->ArmKqueueWriter (this); >> + } >> + #endif >> + >> + >> + if (err) { >> #ifdef OS_UNIX >> if ((errno != EINPROGRESS) && (errno != EWOULDBLOCK) >> && (errno != EINTR)) >> #endif >> >> I''m going to test this patch further and commit it to the git repo >> later tonight. >> >> Aman >> >> On Wed, Jan 21, 2009 at 2:57 PM, Amanda waite <Amanda.Waite at sun.com> >> wrote: >> >>> >>> Thanks, I''ve been testing with what you gave me yesterday and some >>> logging >>> to indicate when the problem condition occurred and was only just able to >>> make it occur. It didn''t crash but I need to examine the logs and see if >>> anything else went wrong. I''ll give this patch a test too. >>> >>> Thanks for your help >>> >>> Amanda >>> >>> Aman Gupta wrote: >>> >>>> >>>> I think I might have found the issue. Can you confirm if this fixes the >>>> failure: >>>> >>>> diff --git a/ext/ed.cpp b/ext/ed.cpp >>>> index 27e2d11..1f82443 100644 >>>> --- a/ext/ed.cpp >>>> +++ b/ext/ed.cpp >>>> @@ -624,7 +624,7 @@ void ConnectionDescriptor::_WriteOutboundData() >>>> assert (GetSocket() != INVALID_SOCKET); >>>> int bytes_written = send (GetSocket(), output_buffer, nbytes, 0); >>>> >>>> - if (bytes_written > 0) { >>>> + if (bytes_written >= 0) { >>>> OutboundDataSize -= bytes_written; >>>> if ((size_t)bytes_written < nbytes) { >>>> int len = nbytes - bytes_written; >>>> >>>> Aman >>>> >>>> On Tue, Jan 20, 2009 at 6:11 PM, Aman Gupta <themastermind1 at gmail.com> >>>> wrote: >>>> >>>> >>>>>>> >>>>>>> Can you check what the value of bConnectPending is when the error >>>>>>> occurs? >>>>>>> >>>>>>> >>>>>> >>>>>> bConnectPending is false when the error occurs, both from running dbx >>>>>> on >>>>>> the >>>>>> core file and from >>>>>> some logging I added. bNotifyWritable is also false, if that helps. >>>>>> >>>>>> ed->Write was called from: >>>>>> >>>>>> if (FD_ISSET (sd, &(SelectData.fdwrites))) >>>>>> ed->Write(); >>>>>> >>>>>> Which is probably obvious, but I can''t see in this code path where a >>>>>> check >>>>>> is made for >>>>>> OutboundPages.size()>0 other than just before the assert that causes >>>>>> the >>>>>> abort. >>>>>> >>>>>> >>>>> >>>>> Before calling select, it adds the fd to the fdset if SelectForWrite() >>>>> is true (em.cpp:748): >>>>> >>>>> if (ed->SelectForWrite()) >>>>> FD_SET (sd, &(SelectData.fdwrites)); >>>>> >>>>> and in SelectForWrite, it checks for bConnectPending and Outbound >>>>> pages (ed.cpp:373): >>>>> >>>>> if (bConnectPending || bNotifyWritable) >>>>> return true; >>>>> else { >>>>> return (GetOutboundDataSize() > 0); >>>>> } >>>>> >>>>> I''m not quite sure why this function uses GetOutboundDataSize() >>>>> instead of OutboundPages.size(). Interestingly, >>>>> DatagramDescriptor::SelectForWrite() uses OutboundPages.size() >>>>> instead: >>>>> >>>>> //return (GetOutboundDataSize() > 0); (Original) >>>>> return (OutboundPages.size() > 0); >>>>> >>>>> So presumably a similar issue was found in the UDP code and fixed. >>>>> >>>>> GetOutboundDataSize() returns OutboundDataSize, so it sounds like >>>>> OutboundDataSize and OutboundPages.size() are getting out of sync >>>>> somewhere. Can you try changing SelectForWrite to return >>>>> (OutboundPages.size() > 0)? >>>>> >>>>> Aman >>>>> >>>>> >>>>> >>>>>> >>>>>> >>>>>>> >>>>>>> You could also just put in a check at the beginning of >>>>>>> _WriteOutboundData(): >>>>>>> >>>>>>> if (OutboundPages.size() == 0) return; >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> I''ll try this and will add some logging to show when it happens as it >>>>>> may >>>>>> not be obvious. >>>>>> >>>>>> Thanks to everyone for their advice so far. >>>>>> >>>>>> Amanda >>>>>> >>>>>> >>>>>>> >>>>>>> Aman >>>>>>> >>>>>>> On Tue, Jan 20, 2009 at 7:22 AM, Amanda Waite <Amanda.Waite at sun.com> >>>>>>> wrote: >>>>>>> >>>>>>> >>>>>>> >>>>>>>> >>>>>>>> I''ve been testing with Thin 1.0.0 on OpenSolaris 2008.11. Every now >>>>>>>> and >>>>>>>> again and particularly under load, a thin will abort in eventmachine >>>>>>>> and >>>>>>>> will drop core. The error is caused by an assertion that in a call >>>>>>>> to >>>>>>>> ed->Write() there should be some data to write and there isn''t. The >>>>>>>> stack >>>>>>>> trace is at the end of this message. I''ve only been able to recreate >>>>>>>> this >>>>>>>> with a single thin, with about 35-40 concurrent users. The App is a >>>>>>>> Rails >>>>>>>> v2.1.0 application (http://incubator.apache.org/olio/). >>>>>>>> >>>>>>>> The error is: >>>>>>>> >>>>>>>> Assertion failed: nbytes > 0, file ed.cpp, line 622 >>>>>>>> Abort (core dumped) >>>>>>>> >>>>>>>> At the point that the failure happens, on entering >>>>>>>> ConnectionDescriptor::_WriteOutboundData(), OutboundPages.size() >>>>>>>> 0, >>>>>>>> although in the only test where it aborted while I was monitoring >>>>>>>> it, >>>>>>>> OutboundDataSize was 16k >>>>>>>> >>>>>>>> The gem is built with g++ and I''ve tried turning down the >>>>>>>> optimisation >>>>>>>> from >>>>>>>> -O3 to none but it made no difference. >>>>>>>> >>>>>>>> Any ideas or suggestion on how to dig deeper into the problem? >>>>>>>> >>>>>>>> Thanks >>>>>>>> >>>>>>>> Amanda, >>>>>>>> >>>>>>>> Stack Trace: >>>>>>>> >>>>>>>> >>>>>>>> root at dn21:/export/faban/olio_rails/olioapp# pstack core >>>>>>>> core ''core'' of 15660: /usr/ruby/1.8/bin/ruby >>>>>>>> /var/ruby/1.8/gem_home/bin/thin -e production s >>>>>>>> fecb17c5 _lwp_kill (1, 6, 8034820, fec5a02e) + 15 >>>>>>>> fec5a03a raise (6, 0, 8034870, fec315ea) + 22 >>>>>>>> fec3160a abort (65737341, 6f697472, 6166206e, 64656c69, 626e203a, >>>>>>>> 73657479) + f2 >>>>>>>> fec3185a _assert (fea9225d, fea92220, 26e, fea830d6) + 82 >>>>>>>> fea832bf _ZN20ConnectionDescriptor18_WriteOutboundDataEv (a1abc40, >>>>>>>> 8038b18, >>>>>>>> 8038bdc, fea82fd1) + 1f9 >>>>>>>> fea830b8 _ZN20ConnectionDescriptor5WriteEv (a1abc40, 28, 8038b60, >>>>>>>> fea89597) >>>>>>>> + f4 >>>>>>>> fea89869 _ZN14EventMachine_t14_RunSelectOnceEv (9aa48c8, 85106, >>>>>>>> fecb0615, >>>>>>>> fea89254) + 2e1 >>>>>>>> fea892a4 _ZN14EventMachine_t8_RunOnceEv (9aa48c8, 8039254, 0, >>>>>>>> fea891d8) + >>>>>>>> 5c >>>>>>>> fea8922f _ZN14EventMachine_t3RunEv (9aa48c8, 808a254, 2, fea72745) + >>>>>>>> 63 >>>>>>>> fea7283e evma_run_machine (9639760, 808a254, 80391fc, 80681fc, >>>>>>>> 83b15c0, >>>>>>>> 808a254) + 106 >>>>>>>> fea7796c _Z29t_run_machine_without_threadsm (83b15c0) + 18 >>>>>>>> 080681fc rb_call0 (83b1480, 83b15c0, 5161, 5161, 0, 0) + 998 >>>>>>>> 08068dd2 rb_call (83b1480, 83b15c0, 5161, 0, 0, 2) + 196 >>>>>>>> 080627f9 rb_eval (83b15c0, 83c71b8) + 19ed >>>>>>>> 08063241 rb_eval (83b15c0, 83c7654) + 2435 >>>>>>>> 080686ca rb_call0 (83b1480, 83b15c0, 1411, 1411, 0, 0) + e66 >>>>>>>> 08068dd2 rb_call (83b1480, 83b15c0, 1411, 0, 0, 0) + 196 >>>>>>>> 08062c99 rb_eval (842937c, 8426f78) + 1e8d >>>>>>>> 0806ea95 block_pass (842937c, 8426fa0) + 3a1 >>>>>>>> 080636c8 rb_eval (842937c, 8427158) + 28bc >>>>>>>> 080686ca rb_call0 (84249e4, 842937c, 13c1, 13c1, 0, 0) + e66 >>>>>>>> 08068dd2 rb_call (84249e4, 842937c, 13c1, 0, 0, 0) + 196 >>>>>>>> 08062c99 rb_eval (84297f0, 82c1c28) + 1e8d >>>>>>>> 080686ca rb_call0 (82b9ba4, 84297f0, 13c1, 13c1, 0, 0) + e66 >>>>>>>> 08068dd2 rb_call (82b9ba4, 84297f0, 13c1, 0, 0, 0) + 196 >>>>>>>> 08062c99 rb_eval (8433098, 842fde4) + 1e8d >>>>>>>> 080686ca rb_call0 (8429840, 8433098, 13c1, 13c1, 0, 8041464) + e66 >>>>>>>> 08068dd2 rb_call (8429840, 8433098, 13c1, 0, 8041464, 1) + 196 >>>>>>>> 08068fbc rb_f_send (1, 8041460, 8433098) + f4 >>>>>>>> 0806820d rb_call0 (80c8bb0, 8433098, fd1, fd1, 1, 8041460) + 9a9 >>>>>>>> 08068dd2 rb_call (80c8bb0, 8433098, fd1, 1, 8041460, 0) + 196 >>>>>>>> 08062c99 rb_eval (82ce4b4, 8360abc) + 1e8d >>>>>>>> 0806344a rb_eval (82ce4b4, 83612dc) + 263e >>>>>>>> 080686ca rb_call0 (82cda78, 82ce4b4, 6351, 6351, 0, 0) + e66 >>>>>>>> 08068dd2 rb_call (82cda78, 82ce4b4, 6351, 0, 0, 2) + 196 >>>>>>>> 080627f9 rb_eval (82ce4b4, 8361534) + 19ed >>>>>>>> 080686ca rb_call0 (82cda78, 82ce4b4, 4ecf, 4ecf, 0, 0) + e66 >>>>>>>> 08068dd2 rb_call (82cda78, 82ce4b4, 4ecf, 0, 0, 0) + 196 >>>>>>>> 08062c99 rb_eval (80c78b4, 83ec9a4) + 1e8d >>>>>>>> 0805dfeb eval_node (80c78b4, 83ec9a4) + 3f >>>>>>>> 0806a819 rb_load (83ecbd4, 0) + 391 >>>>>>>> 0806aaf4 rb_f_load (1, 80469e0, 80c78b4) + 48 >>>>>>>> 0806820d rb_call0 (80c8bb0, 80c78b4, 25c1, 25c1, 1, 80469e0) + 9a9 >>>>>>>> 08068dd2 rb_call (80c8bb0, 80c78b4, 25c1, 1, 80469e0, 1) + 196 >>>>>>>> 08062a39 rb_eval (80c78b4, 80b761c) + 1c2d >>>>>>>> 0805dfeb eval_node (80c78b4, 80b761c) + 3f >>>>>>>> 0805e6d6 ruby_exec_internal (808a254, 80c8c50, 0, 0, 0, a00000) + d6 >>>>>>>> 0805e74b ruby_exec (808a254, 8058cbe, 8047d4c, 8058cc6, feffb7e4, >>>>>>>> feffde70) >>>>>>>> + 27 >>>>>>>> 0805e773 ruby_run (feffb7e4, feffde70, 0, 8047d4c, 8076d8d, >>>>>>>> feffb7e4) >>>>>>>> + >>>>>>>> 23 >>>>>>>> 08058cc6 main (5, 8047d74, 8047d8c) + 3a >>>>>>>> 08058bfe _start (5, 8047e30, 8047e47, 8047e67, 8047e6a, 8047e75) + >>>>>>>> 7a >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> Eventmachine-talk mailing list >>>>>>>> Eventmachine-talk at rubyforge.org >>>>>>>> http://rubyforge.org/mailman/listinfo/eventmachine-talk >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> Eventmachine-talk mailing list >>>>>>> Eventmachine-talk at rubyforge.org >>>>>>> http://rubyforge.org/mailman/listinfo/eventmachine-talk >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> Eventmachine-talk mailing list >>>>>> Eventmachine-talk at rubyforge.org >>>>>> http://rubyforge.org/mailman/listinfo/eventmachine-talk >>>>>> >>>>>> >>>>>> >>>> >>>> _______________________________________________ >>>> Eventmachine-talk mailing list >>>> Eventmachine-talk at rubyforge.org >>>> http://rubyforge.org/mailman/listinfo/eventmachine-talk >>>> >>>> >>> >>> _______________________________________________ >>> Eventmachine-talk mailing list >>> Eventmachine-talk at rubyforge.org >>> http://rubyforge.org/mailman/listinfo/eventmachine-talk >>> >>> >> >> _______________________________________________ >> Eventmachine-talk mailing list >> Eventmachine-talk at rubyforge.org >> http://rubyforge.org/mailman/listinfo/eventmachine-talk >> > > _______________________________________________ > Eventmachine-talk mailing list > Eventmachine-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/eventmachine-talk >
Amanda Waite
2009-Jan-26 01:17 UTC
[Eventmachine-talk] thin occasionally aborts in eventmachine
Thanks for your support and thanks to everyone that replied and offered suggestions. Amanda Aman Gupta wrote:> Awesome. This patch was pushed to git and will make it into our next release. > > Thanks, > Aman > > On Sun, Jan 25, 2009 at 4:31 PM, Amanda waite <Amanda.Waite at sun.com> wrote: > >> With this patch, I''ve now tested with 20 odd runs with up to 8 thins and up >> to 260 concurrent users and I''ve not had a single crash and better not a >> still a single error on any operations. >> >> Looks like the patch fixes the problem. I''ll carry on testing and will let >> you know if I see anything odd. >> >> Thanks >> >> Amanda >> >> Aman Gupta wrote: >> >>> Francis and I just spent some time investigating the issue and I think >>> we''ve finally solved it. This should also fix the long-standing >>> problem in the kqueue reactor. >>> >>> Basically, the problem arises when send() fails with an EWOULDBLOCK or >>> EINPROGRESS. The data is not sent, but its also not put back into >>> OutboundPages, which causes OutboundDataSize to get out of sync and >>> triggers the assertion failure. The fix is to always put unsent data >>> back into the outbound buffer: >>> >>> ==================================================================>>> --- ed.cpp (revision 674) >>> +++ ed.cpp (working copy) >>> @@ -609,31 +609,38 @@ >>> assert (GetSocket() != INVALID_SOCKET); >>> int bytes_written = send (GetSocket(), output_buffer, nbytes, 0); >>> >>> - if (bytes_written > 0) { >>> - OutboundDataSize -= bytes_written; >>> - if ((size_t)bytes_written < nbytes) { >>> - int len = nbytes - bytes_written; >>> - char *buffer = (char*) malloc (len + 1); >>> - if (!buffer) >>> - throw std::runtime_error ("bad alloc >>> throwing back data"); >>> - memcpy (buffer, output_buffer + bytes_written, >>> len); >>> - buffer [len] = 0; >>> - OutboundPages.push_front (OutboundPage (buffer, >>> len)); >>> - } >>> + bool err = false; >>> + if (bytes_written < 0) { >>> + err = true; >>> + bytes_written = 0; >>> + } >>> >>> - #ifdef HAVE_EPOLL >>> - EpollEvent.events = (EPOLLIN | (SelectForWrite() ? >>> EPOLLOUT : 0)); >>> - assert (MyEventMachine); >>> - MyEventMachine->Modify (this); >>> - #endif >>> - #ifdef HAVE_KQUEUE >>> - if (SelectForWrite()) { >>> - MyEventMachine->ArmKqueueWriter (this); >>> - cerr << "POW\n"; >>> - } >>> - #endif >>> + assert (bytes_written >= 0); >>> + OutboundDataSize -= bytes_written; >>> + if ((size_t)bytes_written < nbytes) { >>> + int len = nbytes - bytes_written; >>> + char *buffer = (char*) malloc (len + 1); >>> + if (!buffer) >>> + throw std::runtime_error ("bad alloc throwing >>> back data"); >>> + memcpy (buffer, output_buffer + bytes_written, len); >>> + buffer [len] = 0; >>> + OutboundPages.push_front (OutboundPage (buffer, len)); >>> } >>> - else { >>> + >>> + >>> + #ifdef HAVE_EPOLL >>> + EpollEvent.events = (EPOLLIN | (SelectForWrite() ? EPOLLOUT : 0)); >>> + assert (MyEventMachine); >>> + MyEventMachine->Modify (this); >>> + #endif >>> + #ifdef HAVE_KQUEUE >>> + if (SelectForWrite()) { >>> + MyEventMachine->ArmKqueueWriter (this); >>> + } >>> + #endif >>> + >>> + >>> + if (err) { >>> #ifdef OS_UNIX >>> if ((errno != EINPROGRESS) && (errno != EWOULDBLOCK) >>> && (errno != EINTR)) >>> #endif >>> >>> I''m going to test this patch further and commit it to the git repo >>> later tonight. >>> >>> Aman >>> >>> On Wed, Jan 21, 2009 at 2:57 PM, Amanda waite <Amanda.Waite at sun.com> >>> wrote: >>> >>> >>>> Thanks, I''ve been testing with what you gave me yesterday and some >>>> logging >>>> to indicate when the problem condition occurred and was only just able to >>>> make it occur. It didn''t crash but I need to examine the logs and see if >>>> anything else went wrong. I''ll give this patch a test too. >>>> >>>> Thanks for your help >>>> >>>> Amanda >>>> >>>> Aman Gupta wrote: >>>> >>>> >>>>> I think I might have found the issue. Can you confirm if this fixes the >>>>> failure: >>>>> >>>>> diff --git a/ext/ed.cpp b/ext/ed.cpp >>>>> index 27e2d11..1f82443 100644 >>>>> --- a/ext/ed.cpp >>>>> +++ b/ext/ed.cpp >>>>> @@ -624,7 +624,7 @@ void ConnectionDescriptor::_WriteOutboundData() >>>>> assert (GetSocket() != INVALID_SOCKET); >>>>> int bytes_written = send (GetSocket(), output_buffer, nbytes, 0); >>>>> >>>>> - if (bytes_written > 0) { >>>>> + if (bytes_written >= 0) { >>>>> OutboundDataSize -= bytes_written; >>>>> if ((size_t)bytes_written < nbytes) { >>>>> int len = nbytes - bytes_written; >>>>> >>>>> Aman >>>>> >>>>> On Tue, Jan 20, 2009 at 6:11 PM, Aman Gupta <themastermind1 at gmail.com> >>>>> wrote: >>>>> >>>>> >>>>> >>>>>>>> Can you check what the value of bConnectPending is when the error >>>>>>>> occurs? >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> bConnectPending is false when the error occurs, both from running dbx >>>>>>> on >>>>>>> the >>>>>>> core file and from >>>>>>> some logging I added. bNotifyWritable is also false, if that helps. >>>>>>> >>>>>>> ed->Write was called from: >>>>>>> >>>>>>> if (FD_ISSET (sd, &(SelectData.fdwrites))) >>>>>>> ed->Write(); >>>>>>> >>>>>>> Which is probably obvious, but I can''t see in this code path where a >>>>>>> check >>>>>>> is made for >>>>>>> OutboundPages.size()>0 other than just before the assert that causes >>>>>>> the >>>>>>> abort. >>>>>>> >>>>>>> >>>>>>> >>>>>> Before calling select, it adds the fd to the fdset if SelectForWrite() >>>>>> is true (em.cpp:748): >>>>>> >>>>>> if (ed->SelectForWrite()) >>>>>> FD_SET (sd, &(SelectData.fdwrites)); >>>>>> >>>>>> and in SelectForWrite, it checks for bConnectPending and Outbound >>>>>> pages (ed.cpp:373): >>>>>> >>>>>> if (bConnectPending || bNotifyWritable) >>>>>> return true; >>>>>> else { >>>>>> return (GetOutboundDataSize() > 0); >>>>>> } >>>>>> >>>>>> I''m not quite sure why this function uses GetOutboundDataSize() >>>>>> instead of OutboundPages.size(). Interestingly, >>>>>> DatagramDescriptor::SelectForWrite() uses OutboundPages.size() >>>>>> instead: >>>>>> >>>>>> //return (GetOutboundDataSize() > 0); (Original) >>>>>> return (OutboundPages.size() > 0); >>>>>> >>>>>> So presumably a similar issue was found in the UDP code and fixed. >>>>>> >>>>>> GetOutboundDataSize() returns OutboundDataSize, so it sounds like >>>>>> OutboundDataSize and OutboundPages.size() are getting out of sync >>>>>> somewhere. Can you try changing SelectForWrite to return >>>>>> (OutboundPages.size() > 0)? >>>>>> >>>>>> Aman >>>>>> >>>>>> >>>>>> >>>>>> >>>>>>> >>>>>>>> You could also just put in a check at the beginning of >>>>>>>> _WriteOutboundData(): >>>>>>>> >>>>>>>> if (OutboundPages.size() == 0) return; >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> I''ll try this and will add some logging to show when it happens as it >>>>>>> may >>>>>>> not be obvious. >>>>>>> >>>>>>> Thanks to everyone for their advice so far. >>>>>>> >>>>>>> Amanda >>>>>>> >>>>>>> >>>>>>> >>>>>>>> Aman >>>>>>>> >>>>>>>> On Tue, Jan 20, 2009 at 7:22 AM, Amanda Waite <Amanda.Waite at sun.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>> I''ve been testing with Thin 1.0.0 on OpenSolaris 2008.11. Every now >>>>>>>>> and >>>>>>>>> again and particularly under load, a thin will abort in eventmachine >>>>>>>>> and >>>>>>>>> will drop core. The error is caused by an assertion that in a call >>>>>>>>> to >>>>>>>>> ed->Write() there should be some data to write and there isn''t. The >>>>>>>>> stack >>>>>>>>> trace is at the end of this message. I''ve only been able to recreate >>>>>>>>> this >>>>>>>>> with a single thin, with about 35-40 concurrent users. The App is a >>>>>>>>> Rails >>>>>>>>> v2.1.0 application (http://incubator.apache.org/olio/). >>>>>>>>> >>>>>>>>> The error is: >>>>>>>>> >>>>>>>>> Assertion failed: nbytes > 0, file ed.cpp, line 622 >>>>>>>>> Abort (core dumped) >>>>>>>>> >>>>>>>>> At the point that the failure happens, on entering >>>>>>>>> ConnectionDescriptor::_WriteOutboundData(), OutboundPages.size() >>>>>>>>> 0, >>>>>>>>> although in the only test where it aborted while I was monitoring >>>>>>>>> it, >>>>>>>>> OutboundDataSize was 16k >>>>>>>>> >>>>>>>>> The gem is built with g++ and I''ve tried turning down the >>>>>>>>> optimisation >>>>>>>>> from >>>>>>>>> -O3 to none but it made no difference. >>>>>>>>> >>>>>>>>> Any ideas or suggestion on how to dig deeper into the problem? >>>>>>>>> >>>>>>>>> Thanks >>>>>>>>> >>>>>>>>> Amanda, >>>>>>>>> >>>>>>>>> Stack Trace: >>>>>>>>> >>>>>>>>> >>>>>>>>> root at dn21:/export/faban/olio_rails/olioapp# pstack core >>>>>>>>> core ''core'' of 15660: /usr/ruby/1.8/bin/ruby >>>>>>>>> /var/ruby/1.8/gem_home/bin/thin -e production s >>>>>>>>> fecb17c5 _lwp_kill (1, 6, 8034820, fec5a02e) + 15 >>>>>>>>> fec5a03a raise (6, 0, 8034870, fec315ea) + 22 >>>>>>>>> fec3160a abort (65737341, 6f697472, 6166206e, 64656c69, 626e203a, >>>>>>>>> 73657479) + f2 >>>>>>>>> fec3185a _assert (fea9225d, fea92220, 26e, fea830d6) + 82 >>>>>>>>> fea832bf _ZN20ConnectionDescriptor18_WriteOutboundDataEv (a1abc40, >>>>>>>>> 8038b18, >>>>>>>>> 8038bdc, fea82fd1) + 1f9 >>>>>>>>> fea830b8 _ZN20ConnectionDescriptor5WriteEv (a1abc40, 28, 8038b60, >>>>>>>>> fea89597) >>>>>>>>> + f4 >>>>>>>>> fea89869 _ZN14EventMachine_t14_RunSelectOnceEv (9aa48c8, 85106, >>>>>>>>> fecb0615, >>>>>>>>> fea89254) + 2e1 >>>>>>>>> fea892a4 _ZN14EventMachine_t8_RunOnceEv (9aa48c8, 8039254, 0, >>>>>>>>> fea891d8) + >>>>>>>>> 5c >>>>>>>>> fea8922f _ZN14EventMachine_t3RunEv (9aa48c8, 808a254, 2, fea72745) + >>>>>>>>> 63 >>>>>>>>> fea7283e evma_run_machine (9639760, 808a254, 80391fc, 80681fc, >>>>>>>>> 83b15c0, >>>>>>>>> 808a254) + 106 >>>>>>>>> fea7796c _Z29t_run_machine_without_threadsm (83b15c0) + 18 >>>>>>>>> 080681fc rb_call0 (83b1480, 83b15c0, 5161, 5161, 0, 0) + 998 >>>>>>>>> 08068dd2 rb_call (83b1480, 83b15c0, 5161, 0, 0, 2) + 196 >>>>>>>>> 080627f9 rb_eval (83b15c0, 83c71b8) + 19ed >>>>>>>>> 08063241 rb_eval (83b15c0, 83c7654) + 2435 >>>>>>>>> 080686ca rb_call0 (83b1480, 83b15c0, 1411, 1411, 0, 0) + e66 >>>>>>>>> 08068dd2 rb_call (83b1480, 83b15c0, 1411, 0, 0, 0) + 196 >>>>>>>>> 08062c99 rb_eval (842937c, 8426f78) + 1e8d >>>>>>>>> 0806ea95 block_pass (842937c, 8426fa0) + 3a1 >>>>>>>>> 080636c8 rb_eval (842937c, 8427158) + 28bc >>>>>>>>> 080686ca rb_call0 (84249e4, 842937c, 13c1, 13c1, 0, 0) + e66 >>>>>>>>> 08068dd2 rb_call (84249e4, 842937c, 13c1, 0, 0, 0) + 196 >>>>>>>>> 08062c99 rb_eval (84297f0, 82c1c28) + 1e8d >>>>>>>>> 080686ca rb_call0 (82b9ba4, 84297f0, 13c1, 13c1, 0, 0) + e66 >>>>>>>>> 08068dd2 rb_call (82b9ba4, 84297f0, 13c1, 0, 0, 0) + 196 >>>>>>>>> 08062c99 rb_eval (8433098, 842fde4) + 1e8d >>>>>>>>> 080686ca rb_call0 (8429840, 8433098, 13c1, 13c1, 0, 8041464) + e66 >>>>>>>>> 08068dd2 rb_call (8429840, 8433098, 13c1, 0, 8041464, 1) + 196 >>>>>>>>> 08068fbc rb_f_send (1, 8041460, 8433098) + f4 >>>>>>>>> 0806820d rb_call0 (80c8bb0, 8433098, fd1, fd1, 1, 8041460) + 9a9 >>>>>>>>> 08068dd2 rb_call (80c8bb0, 8433098, fd1, 1, 8041460, 0) + 196 >>>>>>>>> 08062c99 rb_eval (82ce4b4, 8360abc) + 1e8d >>>>>>>>> 0806344a rb_eval (82ce4b4, 83612dc) + 263e >>>>>>>>> 080686ca rb_call0 (82cda78, 82ce4b4, 6351, 6351, 0, 0) + e66 >>>>>>>>> 08068dd2 rb_call (82cda78, 82ce4b4, 6351, 0, 0, 2) + 196 >>>>>>>>> 080627f9 rb_eval (82ce4b4, 8361534) + 19ed >>>>>>>>> 080686ca rb_call0 (82cda78, 82ce4b4, 4ecf, 4ecf, 0, 0) + e66 >>>>>>>>> 08068dd2 rb_call (82cda78, 82ce4b4, 4ecf, 0, 0, 0) + 196 >>>>>>>>> 08062c99 rb_eval (80c78b4, 83ec9a4) + 1e8d >>>>>>>>> 0805dfeb eval_node (80c78b4, 83ec9a4) + 3f >>>>>>>>> 0806a819 rb_load (83ecbd4, 0) + 391 >>>>>>>>> 0806aaf4 rb_f_load (1, 80469e0, 80c78b4) + 48 >>>>>>>>> 0806820d rb_call0 (80c8bb0, 80c78b4, 25c1, 25c1, 1, 80469e0) + 9a9 >>>>>>>>> 08068dd2 rb_call (80c8bb0, 80c78b4, 25c1, 1, 80469e0, 1) + 196 >>>>>>>>> 08062a39 rb_eval (80c78b4, 80b761c) + 1c2d >>>>>>>>> 0805dfeb eval_node (80c78b4, 80b761c) + 3f >>>>>>>>> 0805e6d6 ruby_exec_internal (808a254, 80c8c50, 0, 0, 0, a00000) + d6 >>>>>>>>> 0805e74b ruby_exec (808a254, 8058cbe, 8047d4c, 8058cc6, feffb7e4, >>>>>>>>> feffde70) >>>>>>>>> + 27 >>>>>>>>> 0805e773 ruby_run (feffb7e4, feffde70, 0, 8047d4c, 8076d8d, >>>>>>>>> feffb7e4) >>>>>>>>> + >>>>>>>>> 23 >>>>>>>>> 08058cc6 main (5, 8047d74, 8047d8c) + 3a >>>>>>>>> 08058bfe _start (5, 8047e30, 8047e47, 8047e67, 8047e6a, 8047e75) + >>>>>>>>> 7a >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> Eventmachine-talk mailing list >>>>>>>>> Eventmachine-talk at rubyforge.org >>>>>>>>> http://rubyforge.org/mailman/listinfo/eventmachine-talk >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> Eventmachine-talk mailing list >>>>>>>> Eventmachine-talk at rubyforge.org >>>>>>>> http://rubyforge.org/mailman/listinfo/eventmachine-talk >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> _______________________________________________ >>>>>>> Eventmachine-talk mailing list >>>>>>> Eventmachine-talk at rubyforge.org >>>>>>> http://rubyforge.org/mailman/listinfo/eventmachine-talk >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>> _______________________________________________ >>>>> Eventmachine-talk mailing list >>>>> Eventmachine-talk at rubyforge.org >>>>> http://rubyforge.org/mailman/listinfo/eventmachine-talk >>>>> >>>>> >>>>> >>>> _______________________________________________ >>>> Eventmachine-talk mailing list >>>> Eventmachine-talk at rubyforge.org >>>> http://rubyforge.org/mailman/listinfo/eventmachine-talk >>>> >>>> >>>> >>> _______________________________________________ >>> Eventmachine-talk mailing list >>> Eventmachine-talk at rubyforge.org >>> http://rubyforge.org/mailman/listinfo/eventmachine-talk >>> >>> >> _______________________________________________ >> Eventmachine-talk mailing list >> Eventmachine-talk at rubyforge.org >> http://rubyforge.org/mailman/listinfo/eventmachine-talk >> >> > _______________________________________________ > Eventmachine-talk mailing list > Eventmachine-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/eventmachine-talk >