Rails instances themselves are almost always single-threaded, whereas Mongrel, and it''s acceptor, are multithreaded. In a situation with long-running Rails pages this presents a problem for mod_proxy_balancer. If num_processors is greater than 1 ( default: 950 ), then Mongrel will gladly accept incoming requests and queue them if its rails instance is currently busy. So even though there are non-busy mongrel instances, a busy one can accept a new request and queue it behind a long-running request. I tried setting num_processors to 1. But it looks like this is less than ideal -- I need to dig into mod_proxy_balancer to be sure. But at first glance, it appears this replaces queuing problem with a proxy error. That''s because Mongrel still accepts the incoming request -- only to close the new socket immediately if Rails is busy. Once again, I do need to set up a test and see exactly how mod_proxy_balancer handles this... but... If I understand the problem correctly, then one solution might be moving lines 721 thru 734 into a loop, possibly in its own method, which does sth like this: def myaccept while true return @socket.accept if worker_list.length < num_processors ## check first to see if we can handle the request. Let client worry about connect timeouts. while @num_processors < reap_dead_workers sleep @loop_throttle end end end 720 @acceptor = Thread.new do 721 while true 722 begin * 723 client = @socket.accept * 724 725 if $tcp_cork_opts 726 client.setsockopt(*$tcp_cork_opts) rescue nil 727 end 728 729 worker_list = @workers.list 730 731 if worker_list.length >= @num_processors 732 STDERR.puts "Server overloaded with #{worker_list.length} processors (#@num_processors max). Dropping connection." * 733 client.close rescue Object* 734 reap_dead_workers("max processors") 735 else 736 thread = Thread.new(client) {|c| process_client(c) } 737 thread[:started_on] = Time.now 738 @workers.add(thread) 739 740 sleep @timeout/100 if @timeout > 0 741 end -------------- next part -------------- A non-text attachment was scrubbed... Name: rob.vcf Type: text/x-vcard Size: 116 bytes Desc: not available Url : http://rubyforge.org/pipermail/mongrel-users/attachments/20071015/40912d0f/attachment.vcf
Mod_proxy_balancer is just a weighted round-robin, and doesn''t consider actual worker load, so I don''t think this will help you. Have you looked at Evented Mongrel? Evan On 10/15/07, Robert Mela <rob at robmela.com> wrote:> Rails instances themselves are almost always single-threaded, whereas > Mongrel, and it''s acceptor, are multithreaded. > > In a situation with long-running Rails pages this presents a problem for > mod_proxy_balancer. > > If num_processors is greater than 1 ( default: 950 ), then Mongrel will > gladly accept incoming requests and queue them if its rails instance is > currently busy. So even though there are non-busy mongrel instances, > a busy one can accept a new request and queue it behind a long-running > request. > > I tried setting num_processors to 1. But it looks like this is less > than ideal -- I need to dig into mod_proxy_balancer to be sure. But at > first glance, it appears this replaces queuing problem with a proxy > error. That''s because Mongrel still accepts the incoming request -- > only to close the new socket immediately if Rails is busy. > > Once again, I do need to set up a test and see exactly how > mod_proxy_balancer handles this... but... > > If I understand the problem correctly, then one solution might be moving > lines 721 thru 734 into a loop, possibly in its own method, which does > sth like this: > > def myaccept > while true > return @socket.accept if worker_list.length < num_processors ## > check first to see if we can handle the request. Let client worry about > connect timeouts. > while @num_processors < reap_dead_workers > sleep @loop_throttle > end > end > end > > > > 720 @acceptor = Thread.new do > 721 while true > 722 begin > * 723 client = @socket.accept > * 724 > 725 if $tcp_cork_opts > 726 client.setsockopt(*$tcp_cork_opts) rescue nil > 727 end > 728 > 729 worker_list = @workers.list > 730 > 731 if worker_list.length >= @num_processors > 732 STDERR.puts "Server overloaded with > #{worker_list.length} processors (#@num_processors max). > Dropping connection." > * 733 client.close rescue Object* > 734 reap_dead_workers("max processors") > 735 else > 736 thread = Thread.new(client) {|c| process_client(c) } > 737 thread[:started_on] = Time.now > 738 @workers.add(thread) > 739 > 740 sleep @timeout/100 if @timeout > 0 > 741 end > > > _______________________________________________ > Mongrel-users mailing list > Mongrel-users at rubyforge.org > http://rubyforge.org/mailman/listinfo/mongrel-users > >-- Evan Weaver Cloudburst, LLC
But it is precisely because of mod_proxy_balancer''s round-robin algorithm that I think the fix *would* work. If we give mod_proxy_balancer the option of timing out on connect, it will iterate to the next mongrel instance in the pool. Of course, I should look at Evented Mongrel, and swiftiply. But still, my original question remains. I think that Mongrel would play much more nicely with mod_proxy_balancer out-of-the-box if it refused to call accept() call accept until worker_list.length has been reduced. I personally prefer that to request queuing and certainly to "accept then drop without warning". The wildcard, of course, is what mod_proxy_balancer does in the drop without warning case -- if it gracefully moves on to the next Mongrel server in its balancer pool, then all is well, and I''m making a fuss about nothing. Here''s an armchair scenario to better illustrate why I think a fix would work. Again, I need to test to insure that mod_proxy_balancer doesn''t currently handle the situation gracefully -- Consider: - A pool of 10 mongrels behind mod_proxy_balancer. - One mongrel, say #5, gets a request that takes one minute to run ( e.g., complex report ) - System as a whole gets 10 processing requests per second What happens (I think) with the current code and mod_proxy_balancer - Mongrel instance #5 will continue receiving a new request every second. - Over the one minute period, 10% of requests will either be - queued and unnecessarily delayed (num_processors > 60 ) - be picked up and dropped without warning ( num_processors == 1 ) What should happen if mongrel does not invoke "accept" when all workers are busy: - Mongrel instance #5 will continue getting new *connection requests* every second - mod_proxy_balancer connect() will time out - mod_proxy_balancer will continue cycling through the pool till it finds an available Mongrel instance Again, if all is well under the current scenario -- Apache mod_proxy_balancer gracefully moves on to another Mongrel instance after the accept/drop, then I''ve just made a big fuss over a really dumb question... Evan Weaver wrote:> Mod_proxy_balancer is just a weighted round-robin, and doesn''t > consider actual worker load, so I don''t think this will help you. Have > you looked at Evented Mongrel? > > Evan > > On 10/15/07, Robert Mela <rob at robmela.com> wrote: > >> Rails instances themselves are almost always single-threaded, whereas >> Mongrel, and it''s acceptor, are multithreaded. >> >> In a situation with long-running Rails pages this presents a problem for >> mod_proxy_balancer. >> >> If num_processors is greater than 1 ( default: 950 ), then Mongrel will >> gladly accept incoming requests and queue them if its rails instance is >> currently busy. So even though there are non-busy mongrel instances, >> a busy one can accept a new request and queue it behind a long-running >> request. >> >> I tried setting num_processors to 1. But it looks like this is less >> than ideal -- I need to dig into mod_proxy_balancer to be sure. But at >> first glance, it appears this replaces queuing problem with a proxy >> error. That''s because Mongrel still accepts the incoming request -- >> only to close the new socket immediately if Rails is busy. >> >> Once again, I do need to set up a test and see exactly how >> mod_proxy_balancer handles this... but... >> >> If I understand the problem correctly, then one solution might be moving >> lines 721 thru 734 into a loop, possibly in its own method, which does >> sth like this: >> >> def myaccept >> while true >> return @socket.accept if worker_list.length < num_processors ## >> check first to see if we can handle the request. Let client worry about >> connect timeouts. >> while @num_processors < reap_dead_workers >> sleep @loop_throttle >> end >> end >> end >> >> >> >> 720 @acceptor = Thread.new do >> 721 while true >> 722 begin >> * 723 client = @socket.accept >> * 724 >> 725 if $tcp_cork_opts >> 726 client.setsockopt(*$tcp_cork_opts) rescue nil >> 727 end >> 728 >> 729 worker_list = @workers.list >> 730 >> 731 if worker_list.length >= @num_processors >> 732 STDERR.puts "Server overloaded with >> #{worker_list.length} processors (#@num_processors max). >> Dropping connection." >> * 733 client.close rescue Object* >> 734 reap_dead_workers("max processors") >> 735 else >> 736 thread = Thread.new(client) {|c| process_client(c) } >> 737 thread[:started_on] = Time.now >> 738 @workers.add(thread) >> 739 >> 740 sleep @timeout/100 if @timeout > 0 >> 741 end >> >> >> _______________________________________________ >> Mongrel-users mailing list >> Mongrel-users at rubyforge.org >> http://rubyforge.org/mailman/listinfo/mongrel-users >> >> >> > > >-------------- next part -------------- A non-text attachment was scrubbed... Name: rob.vcf Type: text/x-vcard Size: 116 bytes Desc: not available Url : http://rubyforge.org/pipermail/mongrel-users/attachments/20071015/41c02f4a/attachment.vcf
Oh, I misunderstood your code. I don''t think mod_proxy_balancer gracefully moves on so perhaps you are right. On the other hand, I thought when a worker timed out it got removed from the pool permanently. I can''t seem to verify that one way or the other in the Apache docs, though. Evan On 10/15/07, Robert Mela <rob at robmela.com> wrote:> But it is precisely because of mod_proxy_balancer''s round-robin > algorithm that I think the fix *would* work. If we give > mod_proxy_balancer the option of timing out on connect, it will iterate > to the next mongrel instance in the pool. > > Of course, I should look at Evented Mongrel, and swiftiply. > > But still, my original question remains. I think that Mongrel would > play much more nicely with mod_proxy_balancer out-of-the-box if it > refused to call accept() call accept until worker_list.length has been > reduced. I personally prefer that to request queuing and certainly to > "accept then drop without warning". > > The wildcard, of course, is what mod_proxy_balancer does in the drop > without warning case -- if it gracefully moves on to the next Mongrel > server in its balancer pool, then all is well, and I''m making a fuss > about nothing. > > Here''s an armchair scenario to better illustrate why I think a fix would > work. Again, I need to test to insure that mod_proxy_balancer doesn''t > currently handle the situation gracefully -- > > Consider: > > - A pool of 10 mongrels behind mod_proxy_balancer. > - One mongrel, say #5, gets a request that takes one minute to run ( > e.g., complex report ) > - System as a whole gets 10 processing requests per second > > What happens (I think) with the current code and mod_proxy_balancer > > - Mongrel instance #5 will continue receiving a new request every second. > - Over the one minute period, 10% of requests will either be > - queued and unnecessarily delayed (num_processors > 60 ) > - be picked up and dropped without warning ( num_processors == 1 ) > > What should happen if mongrel does not invoke "accept" when all workers > are busy: > > - Mongrel instance #5 will continue getting new *connection requests* > every second > - mod_proxy_balancer connect() will time out > - mod_proxy_balancer will continue cycling through the pool till it > finds an available Mongrel instance > > > Again, if all is well under the current scenario -- Apache > mod_proxy_balancer gracefully moves on to another Mongrel instance after > the accept/drop, then I''ve just made a big fuss over a really dumb > question... > > > Evan Weaver wrote: > > Mod_proxy_balancer is just a weighted round-robin, and doesn''t > > consider actual worker load, so I don''t think this will help you. Have > > you looked at Evented Mongrel? > > > > Evan > > > > On 10/15/07, Robert Mela <rob at robmela.com> wrote: > > > >> Rails instances themselves are almost always single-threaded, whereas > >> Mongrel, and it''s acceptor, are multithreaded. > >> > >> In a situation with long-running Rails pages this presents a problem for > >> mod_proxy_balancer. > >> > >> If num_processors is greater than 1 ( default: 950 ), then Mongrel will > >> gladly accept incoming requests and queue them if its rails instance is > >> currently busy. So even though there are non-busy mongrel instances, > >> a busy one can accept a new request and queue it behind a long-running > >> request. > >> > >> I tried setting num_processors to 1. But it looks like this is less > >> than ideal -- I need to dig into mod_proxy_balancer to be sure. But at > >> first glance, it appears this replaces queuing problem with a proxy > >> error. That''s because Mongrel still accepts the incoming request -- > >> only to close the new socket immediately if Rails is busy. > >> > >> Once again, I do need to set up a test and see exactly how > >> mod_proxy_balancer handles this... but... > >> > >> If I understand the problem correctly, then one solution might be moving > >> lines 721 thru 734 into a loop, possibly in its own method, which does > >> sth like this: > >> > >> def myaccept > >> while true > >> return @socket.accept if worker_list.length < num_processors ## > >> check first to see if we can handle the request. Let client worry about > >> connect timeouts. > >> while @num_processors < reap_dead_workers > >> sleep @loop_throttle > >> end > >> end > >> end > >> > >> > >> > >> 720 @acceptor = Thread.new do > >> 721 while true > >> 722 begin > >> * 723 client = @socket.accept > >> * 724 > >> 725 if $tcp_cork_opts > >> 726 client.setsockopt(*$tcp_cork_opts) rescue nil > >> 727 end > >> 728 > >> 729 worker_list = @workers.list > >> 730 > >> 731 if worker_list.length >= @num_processors > >> 732 STDERR.puts "Server overloaded with > >> #{worker_list.length} processors (#@num_processors max). > >> Dropping connection." > >> * 733 client.close rescue Object* > >> 734 reap_dead_workers("max processors") > >> 735 else > >> 736 thread = Thread.new(client) {|c| process_client(c) } > >> 737 thread[:started_on] = Time.now > >> 738 @workers.add(thread) > >> 739 > >> 740 sleep @timeout/100 if @timeout > 0 > >> 741 end > >> > >> > >> _______________________________________________ > >> Mongrel-users mailing list > >> Mongrel-users at rubyforge.org > >> http://rubyforge.org/mailman/listinfo/mongrel-users > >> > >> > >> > > > > > > > > >-- Evan Weaver Cloudburst, LLC
Ah, no, they are only marked operational until the retry timeout is elapsed. So I guess if you had extremely small timeouts in Apache and Mongrel both it would work ok. Someone else respond, because clearly I don''t know what I''m talking about. :) Evan On 10/15/07, Evan Weaver <evan at cloudbur.st> wrote:> Oh, I misunderstood your code. > > I don''t think mod_proxy_balancer gracefully moves on so perhaps you > are right. On the other hand, I thought when a worker timed out it got > removed from the pool permanently. I can''t seem to verify that one way > or the other in the Apache docs, though. > > Evan > > On 10/15/07, Robert Mela <rob at robmela.com> wrote: > > But it is precisely because of mod_proxy_balancer''s round-robin > > algorithm that I think the fix *would* work. If we give > > mod_proxy_balancer the option of timing out on connect, it will iterate > > to the next mongrel instance in the pool. > > > > Of course, I should look at Evented Mongrel, and swiftiply. > > > > But still, my original question remains. I think that Mongrel would > > play much more nicely with mod_proxy_balancer out-of-the-box if it > > refused to call accept() call accept until worker_list.length has been > > reduced. I personally prefer that to request queuing and certainly to > > "accept then drop without warning". > > > > The wildcard, of course, is what mod_proxy_balancer does in the drop > > without warning case -- if it gracefully moves on to the next Mongrel > > server in its balancer pool, then all is well, and I''m making a fuss > > about nothing. > > > > Here''s an armchair scenario to better illustrate why I think a fix would > > work. Again, I need to test to insure that mod_proxy_balancer doesn''t > > currently handle the situation gracefully -- > > > > Consider: > > > > - A pool of 10 mongrels behind mod_proxy_balancer. > > - One mongrel, say #5, gets a request that takes one minute to run ( > > e.g., complex report ) > > - System as a whole gets 10 processing requests per second > > > > What happens (I think) with the current code and mod_proxy_balancer > > > > - Mongrel instance #5 will continue receiving a new request every second. > > - Over the one minute period, 10% of requests will either be > > - queued and unnecessarily delayed (num_processors > 60 ) > > - be picked up and dropped without warning ( num_processors == 1 ) > > > > What should happen if mongrel does not invoke "accept" when all workers > > are busy: > > > > - Mongrel instance #5 will continue getting new *connection requests* > > every second > > - mod_proxy_balancer connect() will time out > > - mod_proxy_balancer will continue cycling through the pool till it > > finds an available Mongrel instance > > > > > > Again, if all is well under the current scenario -- Apache > > mod_proxy_balancer gracefully moves on to another Mongrel instance after > > the accept/drop, then I''ve just made a big fuss over a really dumb > > question... > > > > > > Evan Weaver wrote: > > > Mod_proxy_balancer is just a weighted round-robin, and doesn''t > > > consider actual worker load, so I don''t think this will help you. Have > > > you looked at Evented Mongrel? > > > > > > Evan > > > > > > On 10/15/07, Robert Mela <rob at robmela.com> wrote: > > > > > >> Rails instances themselves are almost always single-threaded, whereas > > >> Mongrel, and it''s acceptor, are multithreaded. > > >> > > >> In a situation with long-running Rails pages this presents a problem for > > >> mod_proxy_balancer. > > >> > > >> If num_processors is greater than 1 ( default: 950 ), then Mongrel will > > >> gladly accept incoming requests and queue them if its rails instance is > > >> currently busy. So even though there are non-busy mongrel instances, > > >> a busy one can accept a new request and queue it behind a long-running > > >> request. > > >> > > >> I tried setting num_processors to 1. But it looks like this is less > > >> than ideal -- I need to dig into mod_proxy_balancer to be sure. But at > > >> first glance, it appears this replaces queuing problem with a proxy > > >> error. That''s because Mongrel still accepts the incoming request -- > > >> only to close the new socket immediately if Rails is busy. > > >> > > >> Once again, I do need to set up a test and see exactly how > > >> mod_proxy_balancer handles this... but... > > >> > > >> If I understand the problem correctly, then one solution might be moving > > >> lines 721 thru 734 into a loop, possibly in its own method, which does > > >> sth like this: > > >> > > >> def myaccept > > >> while true > > >> return @socket.accept if worker_list.length < num_processors ## > > >> check first to see if we can handle the request. Let client worry about > > >> connect timeouts. > > >> while @num_processors < reap_dead_workers > > >> sleep @loop_throttle > > >> end > > >> end > > >> end > > >> > > >> > > >> > > >> 720 @acceptor = Thread.new do > > >> 721 while true > > >> 722 begin > > >> * 723 client = @socket.accept > > >> * 724 > > >> 725 if $tcp_cork_opts > > >> 726 client.setsockopt(*$tcp_cork_opts) rescue nil > > >> 727 end > > >> 728 > > >> 729 worker_list = @workers.list > > >> 730 > > >> 731 if worker_list.length >= @num_processors > > >> 732 STDERR.puts "Server overloaded with > > >> #{worker_list.length} processors (#@num_processors max). > > >> Dropping connection." > > >> * 733 client.close rescue Object* > > >> 734 reap_dead_workers("max processors") > > >> 735 else > > >> 736 thread = Thread.new(client) {|c| process_client(c) } > > >> 737 thread[:started_on] = Time.now > > >> 738 @workers.add(thread) > > >> 739 > > >> 740 sleep @timeout/100 if @timeout > 0 > > >> 741 end > > >> > > >> > > >> _______________________________________________ > > >> Mongrel-users mailing list > > >> Mongrel-users at rubyforge.org > > >> http://rubyforge.org/mailman/listinfo/mongrel-users > > >> > > >> > > >> > > > > > > > > > > > > > > > > > > -- > Evan Weaver > Cloudburst, LLC >-- Evan Weaver Cloudburst, LLC
Robert Mela
2007-Oct-15 20:52 UTC
[Mongrel] Workaround found for request queuing vs. num_processors, accept/close
I''ve discovered a setting in mod_proxy_balancer that prevents the Mongrel/Rails request queuing vs. accept/close problem from ever being reached. For each BalancerMember - max=1 -- this caps the maximum number of connections Apache will open a BalancerMember to ''1'' - acquire=N max amount of time (N seconds) to wait to acquire a connection to a balancer member So, at a minimum: BalancerMember http://foo max=1 acquire=1 and I''m using BalancerMember http://127.0.0.1:9000 max=1 keepalive=on acquire=1 timeout=1 ==== I experimented with three mongrel servers, and tied one up for 60 seconds at a time calling "sleep" in a handler. Without the "acquire" parameter mod_proxy_balancer''s simple round-robin scheme blocked waiting when it reached a busy BalancerManager, effectively queuing the request. With "acquire" set the balancer stepped over the busy BalancerMember and continue searching through it''s round-robin cycle. So, whether or not Mongrel''s accept/close and request queuing are issues, there is a setting in mod_proxy_balancer that prevents either problem from being triggered. At a bare minimum, for a single-threaded process running in Mongrel BalancerMember http://127.0.0.1:9000 max=1 acquire=1 BalancerMember http://127.0.0.1:9001 max=1 acquire=1 ... With all BalancerMembers busy Apache returns a 503 Server Busy, which is a heck of a lot more appropriate than 502 proxy error. ===== It turns out that having Mongrel reap threads before calling accept both queueing in Mongrel and prevents Mongrel''s accept/close behavior. But BalancerMembers in mod_proxy_balancer will still need "acquire" to be set -- otherwise proxy client threads will sit around waiting for Mongrel to call accept -- effectively queuing requests in Apache. Since max=1 acquire=1 steps around the queuing problem altogether, the reap-before-accept fix, though more correct, is of no practical benefit. === With the current Mongrel code, BalancerMember max > 1 and Mongrel num_processors > 1 triggers the accept/close bug. Likewise, BalancerMember max >1 with Mongrel num_processors > 1 runs into Mongrel''s request queuing.... === Conclusion --- I''d like to see Mongrel return a 503 Server Busy when an incoming request hits the num_processor limit. For practical use, the fix to the problems is in configuring mod_proxy_balancer such that it shields against encountering either issue. -------------- next part -------------- A non-text attachment was scrubbed... Name: rob.vcf Type: text/x-vcard Size: 116 bytes Desc: not available Url : http://rubyforge.org/pipermail/mongrel-users/attachments/20071015/b4553556/attachment.vcf
Robert Mela
2007-Oct-15 21:18 UTC
[Mongrel] Workaround found for request queuing vs. num_processors, accept/close
Typo -- the following is incorrect: With the current Mongrel code, BalancerMember max > 1 and Mongrel num_processors > 1 triggers the accept/close bug. should be: With the current Mongrel code, BalancerMember max > 1 and Mongrel num_processors = 1 triggers the accept/close bug. ===Robert Mela wrote:> I''ve discovered a setting in mod_proxy_balancer that prevents the > Mongrel/Rails request queuing vs. accept/close problem from ever being > reached. > > For each BalancerMember > > - max=1 -- this caps the maximum number of connections Apache will > open a BalancerMember to ''1'' > - acquire=N max amount of time (N seconds) to wait to acquire a > connection to a balancer member > > So, at a minimum: > > BalancerMember http://foo max=1 acquire=1 > > and I''m using > > BalancerMember http://127.0.0.1:9000 max=1 keepalive=on acquire=1 > timeout=1 > > ====> > I experimented with three mongrel servers, and tied one up for 60 > seconds at a time calling "sleep" in a handler. > > Without the "acquire" parameter mod_proxy_balancer''s simple > round-robin scheme blocked waiting when it reached a busy > BalancerManager, effectively queuing the request. With "acquire" set > the balancer stepped over the busy BalancerMember and continue > searching through it''s round-robin cycle. > > So, whether or not Mongrel''s accept/close and request queuing are > issues, there is a setting in mod_proxy_balancer that prevents either > problem from being triggered. > > At a bare minimum, for a single-threaded process running in Mongrel > > BalancerMember http://127.0.0.1:9000 max=1 acquire=1 > BalancerMember http://127.0.0.1:9001 max=1 acquire=1 > ... > > With all BalancerMembers busy Apache returns a 503 Server Busy, which > is a heck of a lot more appropriate than 502 proxy error. > > =====> > It turns out that having Mongrel reap threads before calling accept > both queueing in Mongrel and prevents Mongrel''s accept/close behavior. > > But BalancerMembers in mod_proxy_balancer will still need "acquire" to > be set -- otherwise proxy client threads will sit around waiting for > Mongrel to call accept -- effectively queuing requests in Apache. > > Since max=1 acquire=1 steps around the queuing problem altogether, the > reap-before-accept fix, though more correct, is of no practical benefit. > > ===> > With the current Mongrel code, BalancerMember max > 1 and Mongrel > num_processors > 1 triggers the accept/close bug. > > Likewise, BalancerMember max >1 with Mongrel num_processors > 1 runs > into Mongrel''s request queuing.... > > ===> > Conclusion --- > > I''d like to see Mongrel return a 503 Server Busy when an incoming > request hits the num_processor limit. > > For practical use, the fix to the problems is in configuring > mod_proxy_balancer such that it shields against encountering either > issue. > > > > > > > _______________________________________________ > Mongrel-users mailing list > Mongrel-users at rubyforge.org > http://rubyforge.org/mailman/listinfo/mongrel-users-------------- next part -------------- A non-text attachment was scrubbed... Name: rob.vcf Type: text/x-vcard Size: 116 bytes Desc: not available Url : http://rubyforge.org/pipermail/mongrel-users/attachments/20071015/e891aa75/attachment.vcf
Evan Weaver
2007-Oct-15 22:04 UTC
[Mongrel] Workaround found for request queuing vs. num_processors, accept/close
Very cool. Can you do a little performance testing to see if it''s more efficient under various loads than the current way? I would expect it would have a small but significant difference when you''re near CPU saturation point, but not much if you''re below (enough free resources already) or above (requests will get piled up regardless). It may be worse in the overloaded situation because no one''s request will get through--the queue might grow indefinitely instead of getting truncated. The 503 behavior seems reasonable. Evan On 10/15/07, Robert Mela <rob at robmela.com> wrote:> Typo -- the following is incorrect: > > With the current Mongrel code, BalancerMember max > 1 and Mongrel > num_processors > 1 triggers the accept/close bug. > > should be: > > With the current Mongrel code, BalancerMember max > 1 and Mongrel > num_processors = 1 triggers the accept/close bug. > > ===> Robert Mela wrote: > > I''ve discovered a setting in mod_proxy_balancer that prevents the > > Mongrel/Rails request queuing vs. accept/close problem from ever being > > reached. > > > > For each BalancerMember > > > > - max=1 -- this caps the maximum number of connections Apache will > > open a BalancerMember to ''1'' > > - acquire=N max amount of time (N seconds) to wait to acquire a > > connection to a balancer member > > > > So, at a minimum: > > > > BalancerMember http://foo max=1 acquire=1 > > > > and I''m using > > > > BalancerMember http://127.0.0.1:9000 max=1 keepalive=on acquire=1 > > timeout=1 > > > > ====> > > > I experimented with three mongrel servers, and tied one up for 60 > > seconds at a time calling "sleep" in a handler. > > > > Without the "acquire" parameter mod_proxy_balancer''s simple > > round-robin scheme blocked waiting when it reached a busy > > BalancerManager, effectively queuing the request. With "acquire" set > > the balancer stepped over the busy BalancerMember and continue > > searching through it''s round-robin cycle. > > > > So, whether or not Mongrel''s accept/close and request queuing are > > issues, there is a setting in mod_proxy_balancer that prevents either > > problem from being triggered. > > > > At a bare minimum, for a single-threaded process running in Mongrel > > > > BalancerMember http://127.0.0.1:9000 max=1 acquire=1 > > BalancerMember http://127.0.0.1:9001 max=1 acquire=1 > > ... > > > > With all BalancerMembers busy Apache returns a 503 Server Busy, which > > is a heck of a lot more appropriate than 502 proxy error. > > > > =====> > > > It turns out that having Mongrel reap threads before calling accept > > both queueing in Mongrel and prevents Mongrel''s accept/close behavior. > > > > But BalancerMembers in mod_proxy_balancer will still need "acquire" to > > be set -- otherwise proxy client threads will sit around waiting for > > Mongrel to call accept -- effectively queuing requests in Apache. > > > > Since max=1 acquire=1 steps around the queuing problem altogether, the > > reap-before-accept fix, though more correct, is of no practical benefit. > > > > ===> > > > With the current Mongrel code, BalancerMember max > 1 and Mongrel > > num_processors > 1 triggers the accept/close bug. > > > > Likewise, BalancerMember max >1 with Mongrel num_processors > 1 runs > > into Mongrel''s request queuing.... > > > > ===> > > > Conclusion --- > > > > I''d like to see Mongrel return a 503 Server Busy when an incoming > > request hits the num_processor limit. > > > > For practical use, the fix to the problems is in configuring > > mod_proxy_balancer such that it shields against encountering either > > issue. > > > > > > > > > > > > > > _______________________________________________ > > Mongrel-users mailing list > > Mongrel-users at rubyforge.org > > http://rubyforge.org/mailman/listinfo/mongrel-users > > >-- Evan Weaver Cloudburst, LLC
Brian Williams
2007-Oct-15 23:43 UTC
[Mongrel] Design flaw? - num_processors, accept/close
We recently ran into exactly this issue. Some rails requests were making external requests that were taking 5 minutes (networking issues out of our control). If your request got queued behind one of these stuck mongrels, the experience was terrible. I experimented with adjusting the mod_proxy_balance settings to try to get it to fail over to the next mongrel (I had hoped that the min,max,smax could all be set to one, and force only one connection to a mongrel at a time) but this didn''t seem to work. Solution - I stuck lighttpd in between. Lighttp has a proxying algorithm that does exactly this - round robin but to worker with lightest load. I''d love to hear that there''s a way to use mod_proxy_balancer, but I couldn''t get to work. --Brian On 10/15/07, Evan Weaver <evan at cloudbur.st> wrote:> > Ah, no, they are only marked operational until the retry timeout is > elapsed. So I guess if you had extremely small timeouts in Apache and > Mongrel both it would work ok. > > Someone else respond, because clearly I don''t know what I''m talking about. > :) > > Evan > > On 10/15/07, Evan Weaver <evan at cloudbur.st> wrote: > > Oh, I misunderstood your code. > > > > I don''t think mod_proxy_balancer gracefully moves on so perhaps you > > are right. On the other hand, I thought when a worker timed out it got > > removed from the pool permanently. I can''t seem to verify that one way > > or the other in the Apache docs, though. > > > > Evan > > > > On 10/15/07, Robert Mela <rob at robmela.com> wrote: > > > But it is precisely because of mod_proxy_balancer''s round-robin > > > algorithm that I think the fix *would* work. If we give > > > mod_proxy_balancer the option of timing out on connect, it will > iterate > > > to the next mongrel instance in the pool. > > > > > > Of course, I should look at Evented Mongrel, and swiftiply. > > > > > > But still, my original question remains. I think that Mongrel would > > > play much more nicely with mod_proxy_balancer out-of-the-box if it > > > refused to call accept() call accept until worker_list.length has > been > > > reduced. I personally prefer that to request queuing and certainly > to > > > "accept then drop without warning". > > > > > > The wildcard, of course, is what mod_proxy_balancer does in the drop > > > without warning case -- if it gracefully moves on to the next Mongrel > > > server in its balancer pool, then all is well, and I''m making a fuss > > > about nothing. > > > > > > Here''s an armchair scenario to better illustrate why I think a fix > would > > > work. Again, I need to test to insure that mod_proxy_balancer doesn''t > > > currently handle the situation gracefully -- > > > > > > Consider: > > > > > > - A pool of 10 mongrels behind mod_proxy_balancer. > > > - One mongrel, say #5, gets a request that takes one minute to run ( > > > e.g., complex report ) > > > - System as a whole gets 10 processing requests per second > > > > > > What happens (I think) with the current code and mod_proxy_balancer > > > > > > - Mongrel instance #5 will continue receiving a new request every > second. > > > - Over the one minute period, 10% of requests will either be > > > - queued and unnecessarily delayed (num_processors > 60 ) > > > - be picked up and dropped without warning ( num_processors == 1 > ) > > > > > > What should happen if mongrel does not invoke "accept" when all > workers > > > are busy: > > > > > > - Mongrel instance #5 will continue getting new *connection requests* > > > every second > > > - mod_proxy_balancer connect() will time out > > > - mod_proxy_balancer will continue cycling through the pool till it > > > finds an available Mongrel instance > > > > > > > > > Again, if all is well under the current scenario -- Apache > > > mod_proxy_balancer gracefully moves on to another Mongrel instance > after > > > the accept/drop, then I''ve just made a big fuss over a really dumb > > > question... > > > > > > > > > Evan Weaver wrote: > > > > Mod_proxy_balancer is just a weighted round-robin, and doesn''t > > > > consider actual worker load, so I don''t think this will help you. > Have > > > > you looked at Evented Mongrel? > > > > > > > > Evan > > > > > > > > On 10/15/07, Robert Mela <rob at robmela.com> wrote: > > > > > > > >> Rails instances themselves are almost always single-threaded, > whereas > > > >> Mongrel, and it''s acceptor, are multithreaded. > > > >> > > > >> In a situation with long-running Rails pages this presents a > problem for > > > >> mod_proxy_balancer. > > > >> > > > >> If num_processors is greater than 1 ( default: 950 ), then Mongrel > will > > > >> gladly accept incoming requests and queue them if its rails > instance is > > > >> currently busy. So even though there are non-busy mongrel > instances, > > > >> a busy one can accept a new request and queue it behind a > long-running > > > >> request. > > > >> > > > >> I tried setting num_processors to 1. But it looks like this is > less > > > >> than ideal -- I need to dig into mod_proxy_balancer to be > sure. But at > > > >> first glance, it appears this replaces queuing problem with a proxy > > > >> error. That''s because Mongrel still accepts the incoming request > -- > > > >> only to close the new socket immediately if Rails is busy. > > > >> > > > >> Once again, I do need to set up a test and see exactly how > > > >> mod_proxy_balancer handles this... but... > > > >> > > > >> If I understand the problem correctly, then one solution might be > moving > > > >> lines 721 thru 734 into a loop, possibly in its own method, which > does > > > >> sth like this: > > > >> > > > >> def myaccept > > > >> while true > > > >> return @socket.accept if worker_list.length < > num_processors ## > > > >> check first to see if we can handle the request. Let client worry > about > > > >> connect timeouts. > > > >> while @num_processors < reap_dead_workers > > > >> sleep @loop_throttle > > > >> end > > > >> end > > > >> end > > > >> > > > >> > > > >> > > > >> 720 @acceptor = Thread.new do > > > >> 721 while true > > > >> 722 begin > > > >> * 723 client = @socket.accept > > > >> * 724 > > > >> 725 if $tcp_cork_opts > > > >> 726 client.setsockopt(*$tcp_cork_opts) rescue nil > > > >> 727 end > > > >> 728 > > > >> 729 worker_list = @workers.list > > > >> 730 > > > >> 731 if worker_list.length >= @num_processors > > > >> 732 STDERR.puts "Server overloaded with > > > >> #{worker_list.length} processors (#@num_processors max). > > > >> Dropping connection." > > > >> * 733 client.close rescue Object* > > > >> 734 reap_dead_workers("max processors") > > > >> 735 else > > > >> 736 thread = Thread.new(client) {|c| > process_client(c) } > > > >> 737 thread[:started_on] = Time.now > > > >> 738 @workers.add(thread) > > > >> 739 > > > >> 740 sleep @timeout/100 if @timeout > 0 > > > >> 741 end > > > >> > > > >> > > > >> _______________________________________________ > > > >> Mongrel-users mailing list > > > >> Mongrel-users at rubyforge.org > > > >> http://rubyforge.org/mailman/listinfo/mongrel-users > > > >> > > > >> > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > Evan Weaver > > Cloudburst, LLC > > > > > -- > Evan Weaver > Cloudburst, LLC > _______________________________________________ > Mongrel-users mailing list > Mongrel-users at rubyforge.org > http://rubyforge.org/mailman/listinfo/mongrel-users >-------------- next part -------------- An HTML attachment was scrubbed... URL: http://rubyforge.org/pipermail/mongrel-users/attachments/20071015/9bfd13cd/attachment.html
On Mon, 15 Oct 2007 12:51:47 -0400 "Evan Weaver" <evan at cloudbur.st> wrote:> Ah, no, they are only marked operational until the retry timeout is > elapsed. So I guess if you had extremely small timeouts in Apache and > Mongrel both it would work ok. > > Someone else respond, because clearly I don''t know what I''m talking about. :)I''m confused, isn''t the point of a balancer that it tries all available backends multiple times before giving up? If m_p_b is aborting on the first accept that''s denied then it''s broken. It should try every one, and possibly twice or three times before giving up. Otherwise it''s not really a "balancer", but more of a "robinder". Also, the proposed solution probably won''t work. If my crufty late night brain is right, this would mean that the backend will attempt a connect to a "sleeping" mongrel and either have to wait until the TCP timeout or just get blocked. Eventually you''re back at the same problem that you have tons of requests piling up, they''re just piled up in the OS tcp stack where no useful work can be done. At least piling them in mongrel means some IO is getting processed. And, it sounds like nobody is actually trying these proposed solutions. Anyone got some metrics? Tried Lucas Carlson''s Dr. Proxy yet? Other solutions? Evented mongrel? -- Zed A. Shaw - Hate: http://savingtheinternetwithhate.com/ - Good: http://www.zedshaw.com/ - Evil: http://yearofevil.com/
On Mon, 15 Oct 2007 16:43:34 -0700 "Brian Williams" <bwillenator at gmail.com> wrote:> We recently ran into exactly this issue. Some rails requests were making > external requests that were taking 5 minutes (networking issues out of our > control).Now that''s a design flaw. If you''re expecting the UI user to wait for a backend request that takes 5 minutes then you need to redesign the workflow and interface. Do it like asynchronous email where the use "sends a request", "awaits a reply", "reads the reply", and doesn''t deal with the backend processing chain of events. If done right, you''ll even get a performance boost and you can distribute the load of these requests out to other servers. It''s also a model most users are familiar with from SMTP processing. -- Zed A. Shaw - Hate: http://savingtheinternetwithhate.com/ - Good: http://www.zedshaw.com/ - Evil: http://yearofevil.com/
> At least piling them in mongrel means some IO is getting processed.Ok, that''s the real issue then. When you have a heavy queuing situation, Ruby can at least schedule the IO among the green threads whereas Apache has to keep them serialized waiting for a worker to open up. Evan On 10/16/07, Zed A. Shaw <zedshaw at zedshaw.com> wrote:> On Mon, 15 Oct 2007 16:43:34 -0700 > "Brian Williams" <bwillenator at gmail.com> wrote: > > > We recently ran into exactly this issue. Some rails requests were making > > external requests that were taking 5 minutes (networking issues out of our > > control). > > Now that''s a design flaw. If you''re expecting the UI user to wait for a backend request that takes 5 minutes then you need to redesign the workflow and interface. Do it like asynchronous email where the use "sends a request", "awaits a reply", "reads the reply", and doesn''t deal with the backend processing chain of events. > > If done right, you''ll even get a performance boost and you can distribute the load of these requests out to other servers. It''s also a model most users are familiar with from SMTP processing. > > -- > Zed A. Shaw > - Hate: http://savingtheinternetwithhate.com/ > - Good: http://www.zedshaw.com/ > - Evil: http://yearofevil.com/ > _______________________________________________ > Mongrel-users mailing list > Mongrel-users at rubyforge.org > http://rubyforge.org/mailman/listinfo/mongrel-users >-- Evan Weaver Cloudburst, LLC
Alexey Verkhovsky
2007-Oct-16 06:02 UTC
[Mongrel] Design flaw? - num_processors, accept/close
On 10/15/07, Zed A. Shaw <zedshaw at zedshaw.com> wrote:>Tried Lucas Carlson''s Dr. Proxy yet? Other solutions? Evented mongrel?HAProxy (and some other proxies smarter than mod_proxy_balancer) solves this problem by allowing to set the maximum number of requests outstanding to any node in the cluster. Setting it to 1 means that it will only ask a Mongrel instance to serve a request when it''s not already doing so. Which makes perfect sense with Rails (single-threaded), especially if you do have something else to serve static content in this setup. Setting num_processors to 1 is only possible when you have a proxy that can restrict itself from sending more than one request per Mongrel. Otherwise, if I remember correctly, you replace occasional delays with HTTP 503s. Not a good trade-off. Setting num_processors low has a positive side effect of restricting how far your Mongrel will grow in memory when put under strain even for a short period. It grows in memory by allocating RAM to new threads (that then pile up on a Rails mutex). With, say, 10 Mongrels and a default num_processors = 1024, allocating memory for 1024 * 10 threads means hundreds of Megabytes of RAM. I usually set num_processors to something a bit bigger than 1 (say, 5), just so that monitoring can hit it at the same time when load balancer does. -- Alexey Verkhovsky CruiseControl.rb [http://cruisecontrolrb.thoughtworks.com] RubyWorks [http://rubyworks.thoughtworks.com]
Paul Butcher
2007-Oct-16 11:35 UTC
[Mongrel] Workaround found for request queuing vs. num_processors, accept/close
On 15 Oct 2007, at 21:52, Robert Mela wrote:> I''ve discovered a setting in mod_proxy_balancer that prevents the > Mongrel/Rails request queuing vs. accept/close problem from ever > being reached.Thanks for that, Robert. We''ve hit exactly the same issue in the past, but have been unable to find a way to persuade mod_proxy_balancer to do the right thing. I posted about this issue here a year or so ago:> http://rubyforge.org/pipermail/mongrel-users/2006-September/ > 001653.htmlBut was unable to get anyone to take it seriously :-( -- paul.butcher->msgCount++ Snetterton, Castle Combe, Cadwell Park... Who says I have a one track mind? http://www.paulbutcher.com/ LinkedIn: https://www.linkedin.com/in/paulbutcher MSN: paul at paulbutcher.com AIM: paulrabutcher Skype: paulrabutcher -------------- next part -------------- An HTML attachment was scrubbed... URL: http://rubyforge.org/pipermail/mongrel-users/attachments/20071016/6c458a67/attachment.html
On 16 Oct 2007, at 06:49, Zed A. Shaw wrote:> On Mon, 15 Oct 2007 16:43:34 -0700 > "Brian Williams" <bwillenator at gmail.com> wrote: > >> We recently ran into exactly this issue. Some rails requests were >> making >> external requests that were taking 5 minutes (networking issues >> out of our >> control). > > Now that''s a design flaw. If you''re expecting the UI user to wait > for a backend request that takes 5 minutes then you need to > redesign the workflow and interface. Do it like asynchronous email > where the use "sends a request", "awaits a reply", "reads the > reply", and doesn''t deal with the backend processing chain of events.Zed, you''re being obtuse. Of course that isn''t what Brian means. What he''s doing is giving a pathological example to illustrate just how badly the mod_proxy_balancer/mongrel/rails combination behaves when things go wrong. Yes, you can mask the problem to some extent by mucking about with your application (and in fact that''s what we''ve done here), but that''s missing the point. It is not unreasonable to expect that some actions performed by an application are "fast" and some are "slow". It''s further not unreasonable to expect a very large difference between the fastest and the slowest actions (if one action takes 10ms and another takes 1s, that''s not unreasonable - but it is a 2 order of magnitude difference). With the obvious setup, fast actions will be delayed behind slow actions. This is a Bad Thing. Furthermore, people are fallible. If I happen to accidentally introduce an action into my system which takes 10s, yes I''ve screwed up and should fix it. But is it reasonable for the fact that I have a single (possibly very rare) action which takes 10s to mean that all the other fast actions are affected? Even when most of my mongrels are idle? Of course, this isn''t really a problem with Mongrel. It''s a problem with Ruby (which doesn''t know what the word "thread" means) and Rails (which doesn''t even manage to successfully make use of the brain-dead version of threading which Ruby does support). -- paul.butcher->msgCount++ Snetterton, Castle Combe, Cadwell Park... Who says I have a one track mind? http://www.paulbutcher.com/ LinkedIn: https://www.linkedin.com/in/paulbutcher MSN: paul at paulbutcher.com AIM: paulrabutcher Skype: paulrabutcher
What''r y''all usin'' for load generation / perf metrics tools? This is a huge area and I wonder if you narrow down to certain things for smoke tests or such. -------------- next part -------------- A non-text attachment was scrubbed... Name: rob.vcf Type: text/x-vcard Size: 116 bytes Desc: not available Url : http://rubyforge.org/pipermail/mongrel-users/attachments/20071016/f9924783/attachment-0001.vcf
What settings did you use in m_p_b? The trick to making it work was "acquire", "max", and probably "timeout". Brian Williams wrote:> We recently ran into exactly this issue. Some rails requests were > making external requests that were taking 5 minutes (networking issues > out of our control). If your request got queued behind one of these > stuck mongrels, the experience was terrible. I experimented with > adjusting the mod_proxy_balance settings to try to get it to fail over > to the next mongrel (I had hoped that the min,max,smax could all be > set to one, and force only one connection to a mongrel at a time) but > this didn''t seem to work. > > Solution - I stuck lighttpd in between. Lighttp has a proxying > algorithm that does exactly this - round robin but to worker with > lightest load. > > I''d love to hear that there''s a way to use mod_proxy_balancer, but I > couldn''t get to work. > > --Brian-------------- next part -------------- A non-text attachment was scrubbed... Name: rob.vcf Type: text/x-vcard Size: 116 bytes Desc: not available Url : http://rubyforge.org/pipermail/mongrel-users/attachments/20071016/9bf27ba6/attachment.vcf
Alexey Verkhovsky wrote:> On 10/15/07, Zed A. Shaw <zedshaw at zedshaw.com> wrote: > >> Tried Lucas Carlson''s Dr. Proxy yet? Other solutions? Evented mongrel? >> > > HAProxy (and some other proxies smarter than mod_proxy_balancer) > solves this problem by allowing to set the maximum number of requests > outstanding to any node in the cluster.But m_p_b is correct in this!!! It''s the "max" attribute to BalancerMember. It''s just a pain to discover the correct combination of parameters!> Setting it to 1 means that it > will only ask a Mongrel instance to serve a request when it''s not > already doing soBut mpb IS doing this correctly, as you specify! It''s a matter of combining "max" and "acquire" attrs on BalancerMember. Perhaps the thing that needs changing is documentation, making this the default mpb behavior, or better documentation ( or all of the above! )> . Which makes perfect sense with Rails > (single-threaded), especially if you do have something else to serve > static content in this setup. > > Setting num_processors to 1 is only possible when you have a proxy > that can restrict itself from sending more than one request per > Mongrel.Which we do in m_p_b, via the "max" attribute to BalancerMember> Otherwise, if I remember correctly, you replace occasional > delays with HTTP 503s. Not a good trade-off. >The 503s would only be generated in the case of incorrect mpb settings. A 503 "server busy" coming from the Mongrel back-end gives developers and admins a better idea of what''s really happening. Consider: the back end has reached maximum capacity. Saying "Hey 503! I''m at max capacity" is better than the current action -- open and close with no indication of what''s wrong.> Setting num_processors low has a positive side effect of restricting > how far your Mongrel will grow in memory when put under strain even > > > 5), just so that monitoring can hit it at the same time when load > balancer does. > >Excellent idea! -------------- next part -------------- A non-text attachment was scrubbed... Name: rob.vcf Type: text/x-vcard Size: 116 bytes Desc: not available Url : http://rubyforge.org/pipermail/mongrel-users/attachments/20071016/416806fd/attachment.vcf
On Tue, 16 Oct 2007 12:49:51 +0100 Paul Butcher <paul at paulbutcher.com> wrote:> On 16 Oct 2007, at 06:49, Zed A. Shaw wrote: > > On Mon, 15 Oct 2007 16:43:34 -0700 > > "Brian Williams" <bwillenator at gmail.com> wrote: > > > >> We recently ran into exactly this issue. Some rails requests were > >> making > >> external requests that were taking 5 minutes (networking issues > >> out of our > >> control). > > > > Now that''s a design flaw. If you''re expecting the UI user to wait > > for a backend request that takes 5 minutes then you need to > > redesign the workflow and interface. Do it like asynchronous email > > where the use "sends a request", "awaits a reply", "reads the > > reply", and doesn''t deal with the backend processing chain of events. > > Zed, you''re being obtuse. Of course that isn''t what Brian means. What > he''s doing is giving a pathological example to illustrate just how > badly the mod_proxy_balancer/mongrel/rails combination behaves when > things go wrong. > > Yes, you can mask the problem to some extent by mucking about with > your application (and in fact that''s what we''ve done here), but > that''s missing the point.No, as usual performance panic has set in and you''re not looking at the problem in the best way to solve it. EVERYTHING takes time. No amount of super fast assembler based multiplexed evented code will get around that. In his example he also was relying on an external service. It is a *classic* mistake to make the user wait for a remote service and all of your backend processes to finish before they see the end of the HTTP request. What people constantly do though, is they assume that the boundary of their transactions must also match the single boundary of one HTTP request. If you break this so that presentation of the process is decoupled from the actual process then you don''t have a problem of the user eating up a web server. But, I''m sure nobody will ever convince programmers of this. They love to run around "performance tuning" stuff instead of just redesigning the system to it appears fast. -- Zed A. Shaw - Hate: http://savingtheinternetwithhate.com/ - Good: http://www.zedshaw.com/ - Evil: http://yearofevil.com/
On 16 Oct 2007, at 13:45, Zed A. Shaw wrote:> No, as usual performance panic has set in and you''re not looking at > the problem in the best way to solve it.Sorry Zed, I have a great deal of respect for your work and your opinions on development. But you seem to have a blind spot here and I just don''t understand why. This has nothing to do with optimisation. It has nothing to do with performance. It''s got everything to do with resilience and reliability. Clearly what you say about waiting for remote services is true. Doing so is a Bad Thing and an application shouldn''t do it. But you''re missing the point. Your philosophy guarantees that your applications performance will be held hostage by the worst performing action within it. What if I screw up and accidentally roll out "bad" action. Should this mean that *every aspect* of my app now behaves terribly? Following your logic, it does. The whole point of a load balancer is that it should enable things to behave sensibly even if one of my backend servers is screwed up. But a mismatch between the expectations encoded within mod_proxy_balancer and Mongrel running Ruby on Rails means that this isn''t the case. Similarly, if I write a quick and dirty reporting action which runs an SQL query which takes 10 seconds to complete, should that screw up my entire application? It seems unreasonable to me that I have to optimise an action like this (why should I care if a reporting action which is only used once a day takes 10 seconds to complete?). I *do* care if every time I run it, though, I cause all the 0.1 second actions to queue up behind it. -- paul.butcher->msgCount++ Snetterton, Castle Combe, Cadwell Park... Who says I have a one track mind? http://www.paulbutcher.com/ LinkedIn: https://www.linkedin.com/in/paulbutcher MSN: paul at paulbutcher.com AIM: paulrabutcher Skype: paulrabutcher
Brian Williams
2007-Oct-16 14:52 UTC
[Mongrel] Design flaw? - num_processors, accept/close
On 10/15/07, Zed A. Shaw <zedshaw at zedshaw.com> wrote:> > On Mon, 15 Oct 2007 16:43:34 -0700 > "Brian Williams" <bwillenator at gmail.com> wrote: > > > We recently ran into exactly this issue. Some rails requests were > making > > external requests that were taking 5 minutes (networking issues out of > our > > control). > > Now that''s a design flaw. If you''re expecting the UI user to wait for a > backend request that takes 5 minutes then you need to redesign the workflow > and interface. Do it like asynchronous email where the use "sends a > request", "awaits a reply", "reads the reply", and doesn''t deal with the > backend processing chain of events. > > If done right, you''ll even get a performance boost and you can distribute > the load of these requests out to other servers. It''s also a model most > users are familiar with from SMTP processing.Just to clarify, we were accessing a web service that typically returns results in < 1 second. But due to network issues out of our control, these requests were going into a black hole, and waiting for tcp timeouts. Admittedly, since this was to an external service, we could shift to a model where all updates are asynchronous, but this doesn''t help in the cases that paul mentions, such as a slower reporting queries or programmer error slow actions which then end up degrading the experience for all users to the site. Assuming we did switch to an asynchronous model, I would think it would be more like - show me latest FOO, trigger backend update to get latest FOO, return last cached FOO. Or if you know what FOO is, you periodically update it, and don''t bother triggering an update. The first request would then return something like ''Fetching results'', right? --Brian -------------- next part -------------- An HTML attachment was scrubbed... URL: http://rubyforge.org/pipermail/mongrel-users/attachments/20071016/7b31a927/attachment.html
Brian Williams wrote:> We recently ran into exactly this issue. Some rails requests were > making external requests that were taking 5 minutes (networking issues > out of our control). If your request got queued behind one of these > stuck mongrels, the experience was terrible. I experimented with > adjusting the mod_proxy_balance settings to try to get it to fail over > to the next mongrel (I had hoped that the min,max,smax could all be > set to one, and force only one connection to a mongrel at a time) but > this didn''t seem to work. > > Solution - I stuck lighttpd in between. Lighttp has a proxying > algorithm that does exactly this - round robin but to worker with > lightest load. >Were you on Apache 2.0 or 2.2? mod_proxy_balancer is 2.2 only. It has the same features as lighty''s balancer, and many important ones that it doesn''t. We had 2.0 <-> lighttpd <-> mongrel_cluster. I like 2.2/mod_proxy_balancer better. Lighty missed some features we needed and I wasn''t prepared to implement. I made heavy use of the following logging features in Apache and m_p_b for diagnostics: -- request duration in microseconds ( lighty only offers seconds ... ugh ) -- client session cookie -- balancer member ( which load balancer member Apache sent the request to ) -- client socket status at end of request I should correct the Round Robin misperception. More accurately mod_proxy_balancer does request balancing. The module sends an equal number of requests to each back end, at least according to the docs. It has another mode where it balances by bytes transferred. The icing on the cake for me was mod_proxy_balancer''s status page. It gives a live view of configured balancer pools and stats for each pool member.> I''d love to hear that there''s a way to use mod_proxy_balancer, but I > couldn''t get to work. > > --Brian-------------- next part -------------- A non-text attachment was scrubbed... Name: rob.vcf Type: text/x-vcard Size: 116 bytes Desc: not available Url : http://rubyforge.org/pipermail/mongrel-users/attachments/20071016/a48d8be0/attachment.vcf
On Tue, 16 Oct 2007 07:52:19 -0700 "Brian Williams" <bwillenator at gmail.com> wrote:> Just to clarify, we were accessing a web service that typically returns > results in < 1 second. But due to network issues out of our control, these > requests were going into a black hole, and waiting for tcp timeouts. > Admittedly, since this was to an external service, we could shift to a model > where all updates are asynchronous, but this doesn''t help in the cases that > paul mentions, such as a slower reporting queries or programmer error slow > actions which then end up degrading the experience for all users to the > site.There''s also an odd thing about performance: users perceive the range of response times as "slow" and not the mean. I have no idea why, but I''ve seen it over and over again. You''ll take a system that has a mean perf of 2 seconds, but a range of .5 to 10 seconds and they think it''s "slow". Tune the system so that it has 3 second mean perf, but only a range of 2 to 4 seconds and they think it''s "fast as hell". But yes, if the service isn''t under your control than you''ll get hit by this over and over. It''s better to setup an "async firewall" both in the service layer and in your UI so that they don''t deal with things that are potentially variable.> Assuming we did switch to an asynchronous model, I would think it would be > more like - show me latest FOO, trigger backend update to get latest FOO, > return last cached FOO. Or if you know what FOO is, you periodically > update it, and don''t bother triggering an update.There''s a few general approaches you can try depending on the type of application you''ve got and what you can do with the UI. I like to generally categorize them into the "polling" or "inbox" methods. In polling, your controllers have four general actions to deal with the request: submit, poll, cancel, get. In this one, the user submits their request like normal, and you display a "Waiting for this to happen dude..." message. Your submit action builds the request and hands it to some service that does the real work (like backgroundrb) then returns the waiting message immediately. The waiting page then simply has a bit of ajaxy good javascript that hits the poll method to see if it''s done yet, and updates a spinner or something. If you want a cancel link on the waiting page, then cancel would abort, tell the backgroundrb to stop, and shunt the user off to the end. Finally, when the poll method says it''s done, you redirect to the "get" action to retrieve the final result. There''s many variations on this depending on the type of tech you have, and typically works best for situations where the user will eventually see something, and they shouldn''t go off doing other things. Such as in a strict biz process where they MUST complete this task before moving on (like in looking up a flight on a reservation system). In the "inbox" method (or email method) you just adopt the tried and true method of having an inbox and outbox. Users get a way to submit requests. That goes in the outbox. They can then see all the pending stuff. You then have your background processor just pull things out of people''s outboxes, process them, and put the results in the inbox. Simple, and the UI for this means you have lots of chances to give them something else to do. The nice thing about this approach is the user doesn''t have to care who''s dealing with it, and they can even setup scheduled tasks that just get run and results are put in their inbox (which would mean no need for an outbox, but maybe a tasks folder). Canceling is simply a matter of removing it from their inbox. Really good uses for this are of course things like email and printing, but also having reports generated, conducting big number crunches, asking for analysis, etc. The trick is then to come up with a UI model that lets you use the inbox method whenever you can. Let''s take the flight system as an example. Currently they have a polling method on most sites, but you could do an inbox method if the user interface was more conversational and based on secondary information about the user (like, they have Delta miles). In this model, the user puts in more information up front, or it''s inferred, then says "tell me what you can find for me." The system uses all its power to go out and look for a flight potentially taking days, and simply putting status or results in the person''s inbox for review or acceptance. The user would have to understand this UI approach and see an advantage to it, so if the results weren''t better than just a quick query via polling it''s pointless. A nice advantage of this is the user can train whatever engine you use the same way they train a Bayes classifier. Imagine if the above reservation system puts potential flights in your inbox, and you go in and just smack a "hell no/maybe so/more like this" button. This trains the flight reservation finding engine to give you better and better results until it finds what you want. Keep this information around and eventually the user will think your flight system is absolutely perfect. The test that you''ve got an "inbox" method right for a flight reservation system is if people can reserve flights they want via txt message off their phones over the course of a day. Another place for this would be in a movie site like Netflix. Instead of saying genres and exact movies, you go in and give demographic information as well as some movies you like. After this initial training, you put out requests for things like "Give me movies I might like that are sad." Netflix makes a "folder" for this called "movies that are sad" and starts to fill it what it thinks you might like. It actually doesn''t know, but as the user classifies what is sad or not, netflix begins to learn more and gives better sad movie results. Eventually users are just getting movies that they''ve pre-classified and don''t even bother searching. And as usual, I''ll put my disclaimer that this isn''t a boolean decision or that these are the only two solutions. In fact, combining the above for a flight reservation system would be a powerful metaphor if you could figure it out without confusing people. Hope that helps. -- Zed A. Shaw - Hate: http://savingtheinternetwithhate.com/ - Good: http://www.zedshaw.com/ - Evil: http://yearofevil.com/
On Tue, 16 Oct 2007 14:01:16 +0100 Paul Butcher <paul at paulbutcher.com> wrote:> On 16 Oct 2007, at 13:45, Zed A. Shaw wrote: > > No, as usual performance panic has set in and you''re not looking at > > the problem in the best way to solve it. > > Sorry Zed, I have a great deal of respect for your work and your > opinions on development. But you seem to have a blind spot here and I > just don''t understand why.That''s because you''re reading my recommendation as "performance tuning vs. design to avoid". If you''ve read any of my work you''ll understand I *never* advocate a boolean argument. Those are for computers. In my argument I''m saying that his problem can never be solved because he doesn''t have control of the performance at all, and why should the user''s HTTP REQUEST be held up for this? You get the distinction? Your HTTP request processing doesn''t have to be coupled to your backend request processing. Break them apart and then you can ensure the user gets rapid feedback, you have fewer bottlnecks, you can push the processing out, and you can measure orthogonal pieces rather than one giant messy process.> This has nothing to do with optimisation. It has nothing to do with > performance. It''s got everything to do with resilience and reliability.No, resilience and reliability are quantifiable metrics. Mean Time Between Failure to be exact. "Performance" is a subjective thing that''s based on human perception. Yes, I know you can go get yourself a little graph of requests per second, but that won''t tell you if the users think it is fast. If you can''t make the computer fast, trick the people to think it''s fast. <snip>> Your philosophy guarantees that your applications performance will be > held hostage by the worst performing action within it.Again, no, I''m not saying don''t try to make it fast. What I''m saying is first thing programmers do is they run off with faulty statistics to "tune" their system, completely ignoring the fact that many times a simple redesign (or complex improvement) can just eliminate the problem entirely. See my most recent reply to Brian for many examples.> What if I screw up and accidentally roll out "bad" action. Should this mean > that *every aspect* of my app now behaves terribly? Following your > logic, it does. The whole point of a load balancer is that it should > enable things to behave sensibly even if one of my backend servers is > screwed up. But a mismatch between the expectations encoded within > mod_proxy_balancer and Mongrel running Ruby on Rails means that this > isn''t the case.Well I didn''t do a logic proof so you''re inventing logic where there is none. My "logic" would be this: The fastest way to do something is to just not do it. Right? That basically gives you an infinite number of requests per second. :-) But ultimately, I''ve been doing this a long time, and the one thing I''ve realized is, no matter how fast you make something, there''s always a bigger dumbass available to make it slow. Hell man, computers have blasted in capability and speed over the years, and still I have to wait for my damn email to render in the fastest email client I could find. No amount of making things fast will protect you against stupidity.> Similarly, if I write a quick and dirty reporting action which runs > an SQL query which takes 10 seconds to complete, should that screw up > my entire application? It seems unreasonable to me that I have to > optimise an action like this (why should I care if a reporting action > which is only used once a day takes 10 seconds to complete?). I *do* > care if every time I run it, though, I cause all the 0.1 second > actions to queue up behind it.I''d reword this: "I have SQL queries that take 10 seconds to complete and I''m stuck using Mongrel because nobody else has stepped up to fix the dumbass crap in Ruby''s GC, IO, and Threads and even the JRuby guys can''t solve their ''mystery'' performance problem with Rails..." Option A: "... I''m totally screwed and should toss myself off a building because I keep banging my head on this thing and it doesn''t go faster." Option B: "... I''m rich and will just put 1000 mongrels in the mix and solve the problem." Option C: "... I know queueing theory and can work up a queuing model that will help me figure out the minimum number of request processors to handle the queue at a 10 req/sec rate." Option D: "... I can analyze the performance of all my stuff and tune it as fast as possible, then try C and B." Option E: "... Well, let''s try some stuff on the front end and see if we can just trick people into thinking how this goes so that there isn''t a problem anymore." Any of them will work, but with Rails option E, D, C, and B work best (in that order). Please don''t do A, it''s not that big a deal. Epliogue (not just for Paul): A lot of people complain that rails should be thread safe. Well, Rails Core folks including DHH also complain that it should be thread safe. Under JRuby you can spin up a ton of real threads with entire Rails apps in each one, but that''s suboptimal for memory usage (like Java cares). If all of you think that Rails shouldn''t have a giant lock, then I have only one suggestion: Get off your damn ass and make it happen. David just made a big effort to make the process for submitting patches much more open and he''s looking for people to solve this problem. I dare say he''s admitted he was wrong about the locking issue and is ultra-keen (I won''t say desperate) for someone to solve it. Nothing is in your way, and the reward will be the glory of making things fast. Worked for me, and I can say it''s totally worth it. As a sweetener, I''ll throw this out: I bet you can NOT make Rails threadsafe. The first person or group of people to finally get rid of the thread locking around Rails requests in Mongrel and make Rails performance match that of Merb or Nitro on average will get a real highschool style trophy from me. The trophy will have a bust of a dog on it and will be enscribed: "Official Mongrel Rails Threadify Ninja Destroyer 1st Place: Zed and DHH were wrong!". The runner-up will get the first set of MUDCRAP-CE certificates, and I''ll hand them out at the next Rails or Ruby conference in person. Alright, I''ve ponied up my end of the bargain. Who''s going to take me on? -- Zed A. Shaw - Hate: http://savingtheinternetwithhate.com/ - Good: http://www.zedshaw.com/ - Evil: http://yearofevil.com/
The query should not take 10 seconds. People should not steal. Still, they do, and I live with the workaround -- locking. So, while the 10-second query is a problem, and worth solving for its own sake, the mod_proxy_balancer solution prevents it from causing the secondary, request queuing problem. That might eliminate enough crisis meetings that someone actually has time to fix the underlying problem without working through the week end. Which in turn lessens the likelyhood of anyone choosing option A.> I''d reword this: "I have SQL queries that take 10 seconds to complete and I''m stuck using Mongrel because nobody else has stepped up to fix the dumbass crap in Ruby''s GC, IO, and Threads and even the JRuby guys can''t solve their ''mystery'' performance problem with Rails..." > > Option A: > > "... I''m totally screwed and should toss myself off a building because I keep banging my head on this thing and it doesn''t go faster." > > Option B: > > "... I''m rich and will just put 1000 mongrels in the mix and solve the problem." > > Option C: > > "... I know queueing theory and can work up a queuing model that will help me figure out the minimum number of request processors to handle the queue at a 10 req/sec rate." > > Option D: > > "... I can analyze the performance of all my stuff and tune it as fast as possible, then try C and B." > > Option E: > > "... Well, let''s try some stuff on the front end and see if we can just trick people into thinking how this goes so that there isn''t a problem anymore." > > Any of them will work, but with Rails option E, D, C, and B work best (in that order). Please don''t do A, it''s not that big a deal. >-------------- next part -------------- A non-text attachment was scrubbed... Name: rob.vcf Type: text/x-vcard Size: 116 bytes Desc: not available Url : http://rubyforge.org/pipermail/mongrel-users/attachments/20071016/f3478dbd/attachment.vcf