Tom Preston-Werner
2009-Sep-18 04:54 UTC
502s with Nginx, Unicorn, and Unix Domain Sockets
I''m doing some benchmarking on our new Rackspace frontend machines (8 core, 16GB) and running into some problems with the Unix domain socket setup. At high request rates (on simple pages) I''m getting a lot of HTTP 502 errors from Nginx. Nothing shows up in the Unicorn error log, but Nginx has the following in its error log: 2009/09/17 19:36:52 [error] 28277#0: *524824 connect() to unix:/data/github/current/tmp/sockets/unicorn.sock failed (11: Resource temporarily unavailable) while connecting to upstream, client: 172.17.1.5, server: github.com, request: "GET /site/junk HTTP/1.1", upstream: "http://unix:/data/github/current/tmp/sockets/unic orn.sock:/site/junk", host: "github.com" This problem does not exist with the nginx -> haproxy -> unicorn setup. Thinking this might be a file descriptor problem, I upped the fd limit to 32768 with no luck. Then I tried upping net.core.somaxconn to 262144 which also had no effect. I thought I''d ask about the problem here to see if anyone knows a simple solution that I''m missing. Perhaps there is an Nginx configuration directive I need? Thanks. Unicorn rocks! Tom -- Tom Preston-Werner GitHub Cofounder http://tom.preston-werner.com github.com/mojombo
Tom Preston-Werner <tom at github.com> wrote:> I''m doing some benchmarking on our new Rackspace frontend machines (8 > core, 16GB) and running into some problems with the Unix domain socket > setup. At high request rates (on simple pages) I''m getting a lot of > HTTP 502 errors from Nginx. Nothing shows up in the Unicorn error log, > but Nginx has the following in its error log:Hi Tom, At what request rates were you running into this? Also how large are your responses? It could be the listen() backlog overflowing if Unicorn isn''t logging anything. Anything in the system/kernel logs (doubtful, actually)? Does increasing the listen :backlog parameter work? Default is 1024 (which is pretty high already), maybe try a higher number along with the net.core.netdev_max_backlog sysctl. Is there a large discrepancy between the times your benchmark client logs, the request time nginx logs, and whatever Rails/Rack logs for request times for any particular request? If the Rails/Rack logging times all seem consistently low but your nginx/benchmark has some weird spikes/outliers, then some are stuck in the kernel listen backlog. How much of the 8 cores are being used on those boxes when this starts happening?> 2009/09/17 19:36:52 [error] 28277#0: *524824 connect() to > unix:/data/github/current/tmp/sockets/unicorn.sock failed (11: > Resource temporarily unavailable) while connecting to upstream, > client: 172.17.1.5, server: github.com, request: "GET /site/junk > HTTP/1.1", upstream: > "http://unix:/data/github/current/tmp/sockets/unic > orn.sock:/site/junk", host: "github.com"Raising proxy_connect_timeout in nginx may be a work around, what is it set to now? On the other hand, keeping it (and :backlog in Unicorn) low would give better indications for failover to other hosts.> This problem does not exist with the nginx -> haproxy -> unicorn > setup. Thinking this might be a file descriptor problem, I upped the > fd limit to 32768 with no luck. Then I tried upping net.core.somaxconn > to 262144 which also had no effect. I thought I''d ask about the > problem here to see if anyone knows a simple solution that I''m > missing. Perhaps there is an Nginx configuration directive I need? > Thanks. Unicorn rocks!Definitely not a file descriptor problem (at least not inside Unicorn). Also, I''m not sure there''s a reason to keep haproxy between nginx and Unicorn... Maybe haproxy in front of the entire cluster of servers. Are you already hitting higher request rates (and more consistent times logged by client/nginx) with: nginx -> unicorn/unix vs nginx -> unicorn/tcp(localhost) ? Under extremely high loads, 502s may actually be wanted since it allows failover to a less loaded box if there''s uneven balancing; but we really need to have numbers on the request rates. -- Eric Wong
Hi Tom, any updates on this? I''d really like to get to the bottom of this, thanks! -- Eric Wong
Tom Preston-Werner
2009-Sep-19 20:23 UTC
502s with Nginx, Unicorn, and Unix Domain Sockets
On Thu, Sep 17, 2009 at 11:48 PM, Eric Wong <normalperson at yhbt.net> wrote:> At what request rates were you running into this? ?Also how large are > your responses? ?It could be the listen() backlog overflowing if Unicorn > isn''t logging anything.I was hitting the 502s at about 1300 req/sec and 80% CPU utilization. Response size was only a few bytes + headers. I was just testing a very simple string response from our Rails app to make sure our setup could tolerate very high request rates.> Does increasing the listen :backlog parameter work? ?Default is 1024 > (which is pretty high already), maybe try a higher number along with the > net.core.netdev_max_backlog sysctl.This was the first thing I tried after getting your response, and it seems that upping the :backlog to 2048 solves the 502 problem! I''m now able to get 1500 req/sec out of Unicorn/UNIX (as opposed to 1350 req/sec with the TCP/HAProxy setup). I''m quite satisfied with this result, and I think this is how we''ll end up deploying the app. Thanks for your help, and I''ll try to keep you updated on how our installation performs and if I see any strange behavior under normal traffic. Tom
Tom Preston-Werner <tom at github.com> wrote:> On Thu, Sep 17, 2009 at 11:48 PM, Eric Wong <normalperson at yhbt.net> wrote: > > At what request rates were you running into this? ??Also how large are > > your responses? ??It could be the listen() backlog overflowing if Unicorn > > isn''t logging anything. > > I was hitting the 502s at about 1300 req/sec and 80% CPU utilization. > Response size was only a few bytes + headers. I was just testing a > very simple string response from our Rails app to make sure our setup > could tolerate very high request rates.Yup, as I suspected: your UNIX socket setup was maxing out right around where your TCP setup was maxing out. TCP is just better at handling/recovering from errors.> > Does increasing the listen :backlog parameter work? ??Default is 1024 > > (which is pretty high already), maybe try a higher number along with the > > net.core.netdev_max_backlog sysctl. > > This was the first thing I tried after getting your response, and it > seems that upping the :backlog to 2048 solves the 502 problem! I''m now > able to get 1500 req/sec out of Unicorn/UNIX (as opposed to 1350 > req/sec with the TCP/HAProxy setup). I''m quite satisfied with this > result, and I think this is how we''ll end up deploying the app.Good to know it worked! However, I do hesitate to recommend a large listen() backlog for production. It can impede with monitoring/failover/load-balancing in multi-server setups even if it looks good on benchmarks. I''ll make a separate call-for-testing mailing list related to this subject in a bit...> Thanks for your help, and I''ll try to keep you updated on how our > installation performs and if I see any strange behavior under normal > traffic.No problem, thanks for the feedback! It''s great to know people actually use it. -- Eric Wong