yossarian1 at gmail.com
2007-Jul-19 02:22 UTC
[Mongrel] one mongrel with hundreds of CLOSE_WAIT tcp connections
Hi, I''m running into a strange issue where one mongrel will sometimes develop hundreds of CLOSE_WAIT TCP connections, mostly to apache (I think -- see sample lsof output below). I haven''t had a chance to get the mongrel with this behavior into USR1 debug mode yet. I didn''t catch it in time. This happens a couple times a day on average at seemingly random times. The problem goes away within a minute or two, probably after a restart of the mongrel. I''m probably doing something crazy to cause this behavior, but I''m having trouble figuring out exactly what the problem is. It probably has to do with the fact that my mongrels get files off of amazon s3 for some requests. We do HTTPClient.get(url) for some s3 urls. I''m setting up dnsmasq now, by the way, but it''s not up yet. My next steps are to get the mongrel into USR1 debugging mode and to see what actions are causing the problem, and to install dnsmasq and cacti. I think I''ve got a good guess which action is responsible -- it''s probably the one that gets the files from s3, but I''ll make sure. If you have any thoughts or other ideas, please let me know. Thanks a ton for your help! Some sample output from lsof: lsof -i -P | grep CLOSE_ | grep mongrel CLOSE_WAIT --mysite mongrel_r 831 root 6u IPv4 95162945 TCP localhost.localdomain :8011->localhost.localdomain:59311 (CLOSE_WAIT) mongrel_r 831 root 9u IPv4 95161753 TCP mysite.com:49269->xxx-xxx-xxx-xxx.amazon.com:80<http://xxx-xxx-xxx-xxx.amazon.com/>(CLOSE_WAIT) mongrel_r 831 root 11u IPv4 95162093 TCP mysite.com:49339-> xxx-xxx-xxx-xxx.amazon.com:80 <http://xxx-xxx-xxx-xxx.amazon.com/>(CLOSE_WAIT) mongrel_r 831 root 14u IPv4 95162202 TCP mysite.com:49373-> xxx-xxx-xxx-xxx.amazon.com:80 <http://xxx-xxx-xxx-xxx.amazon.com/>(CLOSE_WAIT) mongrel_r 831 root 15u IPv4 95162229 TCP mysite.com:49380-> xxx-xxx-xxx-xxx.amazon.com:80 <http://xxx-xxx-xxx-xxx.amazon.com/>(CLOSE_WAIT) mongrel_r 831 root 16u IPv4 95162319 TCP mysite.com:49399->xxx-xxx-xxx-xxx.amazon.com:80<http://xxx-xxx-xxx-xxx.amazon.com/>(CLOSE_WAIT) mongrel_r 831 root 17u IPv4 95162477 TCP mysite.com:49436-> xxx-xxx-xxx-xxx.amazon.com:80 <http://xxx-xxx-xxx-xxx.amazon.com/>(CLOSE_WAIT) mongrel_r 831 root 19u IPv4 95163082 TCP localhost.localdomain :8011->localhost.localdomain:59348 (CLOSE_WAIT) mongrel_r 831 root 20u IPv4 95163221 TCP localhost.localdomain :8011->localhost.localdomain :59387 (CLOSE_WAIT) mongrel_r 831 root 21u IPv4 95163360 TCP localhost.localdomain :8011->localhost.localdomain:59426 (CLOSE_WAIT) mongrel_r 831 root 22u IPv4 95161592 TCP mysite.com:49227 -> xxx-xxx-xxx-xxx.amazon.com:80 <http://xxx-xxx-xxx-xxx.amazon.com/>(CLOSE_WAIT) mongrel_r 831 root 23u IPv4 95163507 TCP localhost.localdomain :8011->localhost.localdomain :59463 (CLOSE_WAIT) mongrel_r 831 root 24u IPv4 95163675 TCP localhost.localdomain :8011->localhost.localdomain:59495 (CLOSE_WAIT) mongrel_r 831 root 25u IPv4 95164041 TCP localhost.localdomain:8011-> localhost.localdomain:59586 (CLOSE_WAIT) mongrel_r 831 root 26u IPv4 95164181 TCP localhost.localdomain :8011->localhost.localdomain:59618 (CLOSE_WAIT) mongrel_r 831 root 27u IPv4 95164293 TCP localhost.localdomain :8011->localhost.localdomain:59641 (CLOSE_WAIT) mongrel_r 831 root 28u IPv4 95164441 TCP localhost.localdomain :8011->localhost.localdomain:59670 (CLOSE_WAIT) mongrel_r 831 root 29u IPv4 95164607 TCP localhost.localdomain :8011->localhost.localdomain:59705 (CLOSE_WAIT) mongrel_r 831 root 30u IPv4 95164748 TCP localhost.localdomain :8011->localhost.localdomain:59746 (CLOSE_WAIT) mongrel_r 831 root 31u IPv4 95164895 TCP localhost.localdomain :8011->localhost.localdomain:59786 (CLOSE_WAIT) mongrel_r 831 root 32u IPv4 95165064 TCP localhost.localdomain :8011->localhost.localdomain:59830 (CLOSE_WAIT) etc. this goes on for 700 lines, where the mongrel on port 8011 has roughly 700 CLOSE_WAIT TCP connections to the 30-60k port range (to apache, I believe). All of these close_waits are for the mongrel on port 8011, in this case. Also, any ideas what''s going on with the close_wait connections to amazon s3? lsof -i -P | grep CLOSE_ | grep mongrel | wc -l 703 netstat | grep 56586 # an example port tcp 1 0 localhost.localdomain:8011 localhost.localdomain:56586 CLOSE_WAIT tcp 0 0 localhost.localdomain :56586 localhost.localdomain:8011 FIN_WAIT2 getnameinfo failed getnameinfo failed -------------- next part -------------- An HTML attachment was scrubbed... URL: http://rubyforge.org/pipermail/mongrel-users/attachments/20070718/1aef129a/attachment.html
yossarian1 at gmail.com
2007-Jul-21 07:15 UTC
[Mongrel] one mongrel with hundreds of CLOSE_WAIT tcp connections
In case anyone else runs into this problem, one brute-force solution is here: http://poocs.net/2006/3/27/the-adventures-of-scaling-stage-3 The poocs solution is for fcgi, but it works for mongrels too. I actually wrote a similar loop myself that just uses lsof to look for a certain number of mongrel close_waits and then restarts the bad mongrel(s). I wrote this before I ran into the poocs.net solution. Ping me if you''re interested in seeing it.