Interested in some feeback on this (does it sound right?), or maybe this might be of interest to others. We are launching a new facebook app in a couple weeks and we did some load testing over the weekend on our unicorn web cluster. The servers are 8 way xeon''s with 24gb ram. Our app ended up being primarily cpu bound. So far the sweet spot for the number of unicorns seems to be around 40. This seemed to yield the most requests per second without overloading the server or hitting memory bandwidth issues. The backlog is at the somaxconn default of 128, I''m still not sure if we will bump that up or not. Increasing the number of unicorns beyond a certain point resulted in a noticable drop in the requests per second the server could handle. I''m pretty sure the cause is the box running out of memory bandwidth. The load average and resource usage in general (except for memory) would keep going down but so did the requests per second. At 80 unicorns the requests per second dropped by more then half. I''m going to disable hyperthreading and rerun some of the tests to see what impact that has. Chris
snacktime <snacktime at gmail.com> wrote:> Interested in some feeback on this (does it sound right?), or maybe > this might be of interest to others.Hi Chris, I think you meant to post this to the mongrel-unicorn at rubyforge.org list, not mongrel-users at rubyforge.org :>> We are launching a new facebook app in a couple weeks and we did some > load testing over the weekend on our unicorn web cluster. The servers > are 8 way xeon''s with 24gb ram. Our app ended up being primarily cpu > bound. So far the sweet spot for the number of unicorns seems to be > around 40. This seemed to yield the most requests per second without > overloading the server or hitting memory bandwidth issues. The > backlog is at the somaxconn default of 128, I''m still not sure if we > will bump that up or not.The default backlog we try to specify is actually 1024 (same as Mongrel). But it''s always a murky value anyways, as it''s kernel/sysctl-dependent. With Unix domain sockets, some folks use crazy values like 2048 to look better on synthetic benchmarks :)> Increasing the number of unicorns beyond a > certain point resulted in a noticable drop in the requests per second > the server could handle. I''m pretty sure the cause is the box > running out of memory bandwidth. The load average and resource usage > in general (except for memory) would keep going down but so did the > requests per second. At 80 unicorns the requests per second dropped > by more then half. I''m going to disable hyperthreading and rerun some > of the tests to see what impact that has.That''s "8 way xeon" _before_ hyperthreading, right? Which family of Xeons are you using, the Pentium4-based crap or the awesome new ones? How much memory is each Unicorn worker using for your app? 40 workers for 8 physical cores sounds reasonable. Depending on the app, I think the reasonable range is anywhere from 2-8 workers per physical core. More if you''re (unfortunately) limited by external network calls, but since you claim to be CPU bound, less. Do you have actual performance numbers you''re able to share? Mean/median request times/rates would be very useful. If your requests run very quickly, you may be limited by contention with the accept() syscall on the listen socket, too. I assume you''re using nginx as the proxy, is this with Unix domain sockets or TCP sockets? Unix domain sockets should give a small performance over TCP if it''s all on the same box. With TCP, you should also check to see you have enough local ports available if you''re hitting extremely high (and probably unrealistic :) request rates. -- Eric Wong
What was the request rate and total bandwidth flowing at your peak? How far is that from your theoretical potential on the box? On 21 Jun 2010, at 16:58, snacktime wrote:> Interested in some feeback on this (does it sound right?), or maybe > this might be of interest to others. > > We are launching a new facebook app in a couple weeks and we did some > load testing over the weekend on our unicorn web cluster. The servers > are 8 way xeon''s with 24gb ram. Our app ended up being primarily cpu > bound. So far the sweet spot for the number of unicorns seems to be > around 40. This seemed to yield the most requests per second without > overloading the server or hitting memory bandwidth issues. The > backlog is at the somaxconn default of 128, I''m still not sure if we > will bump that up or not. Increasing the number of unicorns beyond a > certain point resulted in a noticable drop in the requests per second > the server could handle. I''m pretty sure the cause is the box > running out of memory bandwidth. The load average and resource usage > in general (except for memory) would keep going down but so did the > requests per second. At 80 unicorns the requests per second dropped > by more then half. I''m going to disable hyperthreading and rerun some > of the tests to see what impact that has. > > Chris > _______________________________________________ > Mongrel-users mailing list > Mongrel-users at rubyforge.org > http://rubyforge.org/mailman/listinfo/mongrel-users
On Jun 21, 2010, at 5:16 PM, Eric Wong wrote:>> overloading the server or hitting memory bandwidth issues. The >> backlog is at the somaxconn default of 128, I''m still not sure if we >> will bump that up or not. > > The default backlog we try to specify is actually 1024 (same as > Mongrel). But it''s always a murky value anyways, as it''s > kernel/sysctl-dependent. With Unix domain sockets, some folks use > crazy values like 2048 to look better on synthetic benchmarks :)Somewhat related -- I''ve been meaning to discuss the finer points of backlog tuning. I''ve been experimenting with the multi-server socket+TCP megaunicorn configuration from your CDT: http://rubyforge.org/pipermail/mongrel-unicorn/2009-September/000033.html Which I think is what this sentence from TUNING is talking about? "Setting a very low value for the :backlog parameter in ?listen? directives can allow failover to happen more quickly if your cluster is configured for it." Our app can catch a batch of requests which will be slow (1-3s), and these can pool on one individual server in our load-balanced EC2 cluster -- exactly the case for the multi-server failover setup. I''ve put this into production under a healthy load (5000+ RPM) and it appears to work really well! Produces very high requests/s rates at significantly higher concurrency than without, and serves zero 502 errors (part of the goal) I currently I have the unix socket set to a backlog of 64, then failing over to a TCP listener using backlog 1024 (so that things are queued rather than 502''d) I can imagine there might be a case for keeping the TCP backlog low as well & serving errors when overloaded, rather than getting caught in an unrecoverable back-queue tarpit I''m currently failing-over to a dedicated "backup" instance, so that I could measure exactly how much traffic is being offloaded. This means my benchmarks w/o failover are 1 server, but with failover is actually 2 servers. We''re reconfiguring to something more like the original diagram at which point I''ll do some cluster-wide stress-tests & share data/scripts/process. BTW, this configuration needs a cool name! -jamie http://jamiedubs.com http://fffff.at
Jamie Wilkinson <jamie at tramchase.com> wrote:> On Jun 21, 2010, at 5:16 PM, Eric Wong wrote: > >> overloading the server or hitting memory bandwidth issues. The > >> backlog is at the somaxconn default of 128, I''m still not sure if > >> we will bump that up or not. > > > > The default backlog we try to specify is actually 1024 (same as > > Mongrel). But it''s always a murky value anyways, as it''s > > kernel/sysctl-dependent. With Unix domain sockets, some folks use > > crazy values like 2048 to look better on synthetic benchmarks :) > > Somewhat related -- I''ve been meaning to discuss the finer points of > backlog tuning. > > I''ve been experimenting with the multi-server socket+TCP megaunicorn > configuration from your CDT: > http://rubyforge.org/pipermail/mongrel-unicorn/2009-September/000033.html > > Which I think is what this sentence from TUNING is talking about? > > "Setting a very low value for the :backlog parameter in ?listen? > directives can allow failover to happen more quickly if your > cluster is configured for it."Yes. <snip> Thanks for sharing, and good to this is working well for you. I''m still unlikely to have the chance to test this anywhere soon, but maybe more folks can give it a try now that we''ve had one successful report. More reports (success or not) would definitely be good to hear.> BTW, this configuration needs a cool name!Since you''re the first person brave enough to try (or at least report about it), you shall have the honor of naming it :) -- Eric Wong
On Mon, Jun 21, 2010 at 5:16 PM, Eric Wong <normalperson at yhbt.net> wrote:> snacktime <snacktime at gmail.com> wrote: >> Interested in some feeback on this (does it sound right?), or maybe >> this might be of interest to others. > > Hi Chris, > > I think you meant to post this to the mongrel-unicorn at rubyforge.org > list, not mongrel-users at rubyforge.org :> >Yes, not sure how that got mixed up...> > That''s "8 way xeon" _before_ hyperthreading, right? ?Which family of > Xeons are you using, the Pentium4-based crap or the awesome new ones? >Two quad core Nehalems on each server.> How much memory is each Unicorn worker using for your app? >Undoubtedly this is lower then it will be under a real load, but under our load tests they stabilize at around 160mb.> Do you have actual performance numbers you''re able to share? > Mean/median request times/rates would be very useful. ?If your requests > run very quickly, you may be limited by contention with the accept() > syscall on the listen socket, too. >I had two different types of requests to test that I did in varying combinations. One takes on average 600ms, and the other 40ms. 98% of our requests will be the faster one. Deviations were really low.> I assume you''re using nginx as the proxy, is this with Unix domain > sockets or TCP sockets? ?Unix domain sockets should give a small > performance over TCP if it''s all on the same box. >Yes nginx with domain sockets. Chris
>> Somewhat related -- I''ve been meaning to discuss the finer points of >> backlog tuning. >> >> I''ve been experimenting with the multi-server socket+TCP megaunicorn >> configuration from your CDT: >> http://rubyforge.org/pipermail/mongrel-unicorn/2009-September/000033.htmlSo I''m in the position of launching a web app in a couple of weeks that is pretty much guaranteed to get huge traffic. I''m working with ops people who are very good but this is not how they would normally setup load balancing and scale out. I''m having a meeting with our network ops lead tomorrow to talk about this. I like the idea of this approach, it seems like it gives you more fine grained control over how much load you put on individual servers as well as how individual requests are handled. But I''m not too keen on using something like this at scale when we simply don''t have the chance to test it out at a smaller scale. I have yet to see anyone with this setup running at scale. That of course doesn''t mean it''s not a great idea, only that I doubt our ops guys are going to want to be the first. They are already overworked as it is:) So assuming we will scale out the ''normal'' way by not having a short backlog, any info on how to manage that? Should we control the backlog queue in nginx (not sure exactly how I would do that) or via the listen backlog? I was looking around last night and couldn''t find a way to actually poll the listen backlog queue size. Also, any ideas on how you would practically manage this type of load balancing setup? Seems like you would have some type of ''reserve'' cluster for requests that hit the listen backlog, and when you start seeing too much traffic going to the reserve, you add more servers to your main pool. How else would you manage the configuration for something like this when you are working with 100 - 200 servers? You can''t be changing the nginx configs every time you add servers, that''s just not practical. Chris
>> Somewhat related -- I''ve been meaning to discuss the finer points of >> backlog tuning. >> >> I''ve been experimenting with the multi-server socket+TCP megaunicorn >> configuration from your CDT: >> http://rubyforge.org/pipermail/mongrel-unicorn/2009-September/000033.htmlOn Jun 22, 2010, at 11:03 AM, snacktime wrote:> Seems like you would have some type of ''reserve'' > cluster for requests that hit the listen backlog, and when you start > seeing too much traffic going to the reserve, you add more servers to > your main pool. How else would you manage the configuration for > something like this when you are working with 100 - 200 servers? You > can''t be changing the nginx configs every time you add servers, that''s > just not practical.We are using chef for machine configuration which makes these kinds of numbers doable http://wiki.opscode.com/display/chef/Home I would love to see a nginx module for distributed configuration mgmnt Right now we are running 6 frontend machines, 4 in use & 2 in reserve like you described. We are doing about 5000rpm with this, almost all dynamic. 10-30% of requests might be ''slow'' (1+s) depending on usage patterns. To measure health I am using munin to watch system load, nginx requests & nginx errors. In this configuration 502 Bad Gateways from frontend nginx indicate a busy unicorn socket & thus a handoff of the request to the backups. Then we measure the rails production.log for request counts + speed on each server as well as using NewRelic RPM monit also emails us when 502s show up. In theory monit could be automatically spinning up another backup server, provisioning it using chef, then reprovisioning the rest of the cluster to start handing over traffic. Alternately the new server could just act as backup for the one overloaded machine, which could make isolating performance issues easier. -jamie
On Jun 21, 2010, at 9:53 PM, Eric Wong wrote:> Thanks for sharing, and good to this is working well for you. > > I''m still unlikely to have the chance to test this anywhere soon, but > maybe more folks can give it a try now that we''ve had one successful > report. More reports (success or not) would definitely be good > to hear. > >> BTW, this configuration needs a cool name! > > Since you''re the first person brave enough to try (or at least report > about it), you shall have the honor of naming it :)The all-knowing WikiAnswers says "a group of unicorns is a blessing" :) http://wiki.answers.com/Q/What_is_a_group_of_Unicorns_called Some great fan art out there: http://www.elfwood.com/~ara-tun/Unicorn-Herd.2537340.html But my coworkers & I are voting "pegacorn" http://images.elfwood.com/art/m/i/michelle16/pegacorn.jpg -jamie
snacktime <snacktime at gmail.com> wrote:> >> Somewhat related -- I''ve been meaning to discuss the finer points of > >> backlog tuning. > >> > >> I''ve been experimenting with the multi-server socket+TCP megaunicorn > >> configuration from your CDT: > >> http://rubyforge.org/pipermail/mongrel-unicorn/2009-September/000033.html > > So I''m in the position of launching a web app in a couple of weeks > that is pretty much guaranteed to get huge traffic. I''m working with > ops people who are very good but this is not how they would normally > setup load balancing and scale out. I''m having a meeting with our > network ops lead tomorrow to talk about this. I like the idea of this > approach, it seems like it gives you more fine grained control over > how much load you put on individual servers as well as how individual > requests are handled. But I''m not too keen on using something like > this at scale when we simply don''t have the chance to test it out at a > smaller scale. I have yet to see anyone with this setup running at > scale. That of course doesn''t mean it''s not a great idea, only that I > doubt our ops guys are going to want to be the first. They are > already overworked as it is:)No worries. Don''t ever feel obligated to try something you''re not comfortable with. Heck, it took months before anybody besides myself was comfortable with Unicorn.> So assuming we will scale out the ''normal'' way by not having a short > backlog, any info on how to manage that? Should we control the > backlog queue in nginx (not sure exactly how I would do that) or via > the listen backlog? I was looking around last night and couldn''t find > a way to actually poll the listen backlog queue size.nginx lets you specify a backlog=num with the "listen" directive much like Unicorn does (Unicorn steals most configuration parameter names/options from nginx): http://wiki.nginx.org/NginxHttpCoreModule#listen If you use Linux, you can poll the current listen queue using Raindrops (http://raindrops.bogomips.org/), the ss(8) utility, or parsing /proc/net/tcp and/or /proc/net/unix. Unfortunately, checking the listen queue for Unix domain sockets is expensive, Raindrops and ss(8) both need to parse /proc/net/unix because that info isn''t available via netlink.> Also, any ideas on how you would practically manage this type of load > balancing setup? Seems like you would have some type of ''reserve'' > cluster for requests that hit the listen backlog, and when you start > seeing too much traffic going to the reserve, you add more servers to > your main pool. How else would you manage the configuration for > something like this when you are working with 100 - 200 servers? You > can''t be changing the nginx configs every time you add servers, that''s > just not practical.I''ve never tried this setup, so what Jamie said :) One extra note, 100-200 hosts in an upstream {} block makes a very long nginx config file. You could use ERB or something else to template, but based on a previous reading of the nginx source code, you can also setup a round-robin DNS entry for all the servers. nginx only does DNS lookups for upstreams at load time. For round-robin DNS entries, nginx adds an entry for every IP address a name resolves to, so just specify the one DNS name in the upstream block instead of the list of IP(s). Just remember to HUP the nginxes (or if you''re forgetful, make an occasional cronjob to HUP them) when you make DNS changes and add/remove a box. -- Eric Wong