We''ve been running unicorn-3.6.2 on REE 1.8.7 2011.12 in production for quite some time and we use monit to monitor each unicorn worker. Occasionally, I''ll get a notification that a worker has timed-out and has been re-spawned. In all these cases, when I look at the rails logs, I can see the last request that the worker handled, and they all have appeared to complete successfully from the client''s perspective (rails and nginx respond with 200), but the unicorn log shows that it was killed due to timeout. This has always been relatively rare and I thought it was a non-problem. Until today. Today, for about a 7 minute period, our workers would continually report as having timed-out and would be killed by the master. After re-spawning, the workers would serve a handful of requests and then eventually be killed again. During this time, our servers (Web, PG DB, and redis) were not under load and IO was normal. After the last monit notification at 8:30, everything went back to normal. I understand why unicorns would timeout if they were waiting (>120 secs) on IO, but there aren''t any orphaned requests in the rails log. For each request line, there''s a corresponding completion line. No long running queries to blame on PG, either. I know we''re probably due for an upgrade, but I''m hoping to get to the bottom of these unexplained timeouts. Thanks for your help! -Nick _______________________________________________ Unicorn mailing list - mongrel-unicorn@rubyforge.org http://rubyforge.org/mailman/listinfo/mongrel-unicorn Do not quote signatures (like this one) or top post when replying
nick@auger.net wrote:> We''ve been running unicorn-3.6.2 on REE 1.8.7 2011.12 in production > for quite some time and we use monit to monitor each unicorn worker. > Occasionally, I''ll get a notification that a worker has timed-out and > has been re-spawned. In all these cases, when I look at the rails > logs, I can see the last request that the worker handled, and they all > have appeared to complete successfully from the client''s perspective > (rails and nginx respond with 200), but the unicorn log shows that it > was killed due to timeout. This has always been relatively rare and I > thought it was a non-problem. > > During this time, our servers (Web, PG DB, and redis) were not under > load and IO was normal. After the last monit notification at 8:30, > everything went back to normal. I understand why unicorns would > timeout if they were waiting (>120 secs) on IO, but there aren''t any > orphaned requests in the rails log. For each request line, there''s a > corresponding completion line. No long running queries to blame on > PG, either.Can you take a look at the nginx error and access logs? From what you''re saying, there''s a chance a request never even got to the Rails layer. However, nginx should be logging failed/long-running requests to unicorn. _______________________________________________ Unicorn mailing list - mongrel-unicorn@rubyforge.org http://rubyforge.org/mailman/listinfo/mongrel-unicorn Do not quote signatures (like this one) or top post when replying
"Eric Wong" <normalperson@yhbt.net> said:> Can you take a look at the nginx error and access logs? From what > you''re saying, there''s a chance a request never even got to the Rails > layer. However, nginx should be logging failed/long-running requests to > unicorn.The nginx access logs show frequent 499 responses. The error logs are filled with: connect() failed (110: Connection timed out) while connecting to upstream upstream timed out (110: Connection timed out) while reading response header from upstream What specific pieces of information should I be looking for in the logs? Thanks Again, -Nick _______________________________________________ Unicorn mailing list - mongrel-unicorn@rubyforge.org http://rubyforge.org/mailman/listinfo/mongrel-unicorn Do not quote signatures (like this one) or top post when replying
nick@auger.net wrote:> "Eric Wong" <normalperson@yhbt.net> said: > > Can you take a look at the nginx error and access logs? From what > > you''re saying, there''s a chance a request never even got to the Rails > > layer. However, nginx should be logging failed/long-running requests to > > unicorn. > > The nginx access logs show frequent 499 responses. The error logs are filled with: > > connect() failed (110: Connection timed out) while connecting to upstream > upstream timed out (110: Connection timed out) while reading response header from upstream > > What specific pieces of information should I be looking for in the logs?Do you have any other requests in your logs which could be taking a long time and hogging workers, but not high enough to trigger the unicorn kill timeout. (enable $request_time in nginx access logs if you haven''t already) Is this with Unix or TCP sockets? If it''s over a LAN, maybe there''s still a bad switch/port/cable somewhere (that happens often to me). With Unix sockets, I don''t recall encountering recent problems under Linux. Which OS are you running? I hit several TCP bugs in the early 3.7 Linux kernels, too... _______________________________________________ Unicorn mailing list - mongrel-unicorn@rubyforge.org http://rubyforge.org/mailman/listinfo/mongrel-unicorn Do not quote signatures (like this one) or top post when replying
"Eric Wong" <normalperson@yhbt.net> said:> nick@auger.net wrote: >> "Eric Wong" <normalperson@yhbt.net> said: >> > Can you take a look at the nginx error and access logs? From what >> > you''re saying, there''s a chance a request never even got to the Rails >> > layer. However, nginx should be logging failed/long-running requests to >> > unicorn. >> >> The nginx access logs show frequent 499 responses. The error logs are filled >> with: >> >> connect() failed (110: Connection timed out) while connecting to upstream >> upstream timed out (110: Connection timed out) while reading response header from >> upstream >> >> What specific pieces of information should I be looking for in the logs? > > Do you have any other requests in your logs which could be taking > a long time and hogging workers, but not high enough to trigger the > unicorn kill timeout.I don''t *think* so. Most requests finish <300ms. We do have some more intensive code-paths, but they''re administrative and called much less frequently. Most of these pages complete in <3seconds. For requests that made it to rails logging, the LAST processed request before the worker timed-out all completed very quickly (and no real pattern in terms of which page may be triggering it.)> (enable $request_time in nginx access logs if you haven''t already)I''ll enable this.> Is this with Unix or TCP sockets? If it''s over a LAN, maybe there''s > still a bad switch/port/cable somewhere (that happens often to me).TCP sockets, with nginx and unicorn running on the same box.> With Unix sockets, I don''t recall encountering recent problems under > Linux. Which OS are you running?Stock RHEL 5, kernel 2.6.18. Thanks again, -Nick _______________________________________________ Unicorn mailing list - mongrel-unicorn@rubyforge.org http://rubyforge.org/mailman/listinfo/mongrel-unicorn Do not quote signatures (like this one) or top post when replying
nick@auger.net wrote:> "Eric Wong" <normalperson@yhbt.net> said: > > > > Do you have any other requests in your logs which could be taking > > a long time and hogging workers, but not high enough to trigger the > > unicorn kill timeout.> I don''t *think* so. Most requests finish <300ms. We do have some > more intensive code-paths, but they''re administrative and called much > less frequently. Most of these pages complete in <3seconds. > > For requests that made it to rails logging, the LAST processed request > before the worker timed-out all completed very quickly (and no real > pattern in terms of which page may be triggering it.)This is really strange. This was only really bad for a 7s period? Has it happened again? Anything else going on with the system at that time? Swapping, particularly... And if you''re inside a VM, maybe your neighbors were hogging things. Large PUT/POST requests which require filesystem I/O are particularly sensitive to this.> > Is this with Unix or TCP sockets? If it''s over a LAN, maybe there''s > > still a bad switch/port/cable somewhere (that happens often to me). > > TCP sockets, with nginx and unicorn running on the same box.OK, that probably rules out a bunch of problems. Just to be thorough, anything interesting in dmesg or syslogs?> > With Unix sockets, I don''t recall encountering recent problems under > > Linux. Which OS are you running? > > Stock RHEL 5, kernel 2.6.18.RHEL 5.0 or 5.x? I can''t remember /that/ far back to 5.0 (I don''t think I even tried it until 5.2), but don''t recall anything being obviously broken in those... _______________________________________________ Unicorn mailing list - mongrel-unicorn@rubyforge.org http://rubyforge.org/mailman/listinfo/mongrel-unicorn Do not quote signatures (like this one) or top post when replying
"Eric Wong" <normalperson@yhbt.net> said:> nick@auger.net wrote: >> "Eric Wong" <normalperson@yhbt.net> said: >> > >> > Do you have any other requests in your logs which could be taking >> > a long time and hogging workers, but not high enough to trigger the >> > unicorn kill timeout. > >> I don''t *think* so. Most requests finish <300ms. We do have some >> more intensive code-paths, but they''re administrative and called much >> less frequently. Most of these pages complete in <3seconds. >> >> For requests that made it to rails logging, the LAST processed request >> before the worker timed-out all completed very quickly (and no real >> pattern in terms of which page may be triggering it.) > > This is really strange. This was only really bad for a 7s period?It was a 7 minute period. All of the workers would become busy and exceed their >120s timeout. Master would kill and re-spawn them, they''d start to respond to a handful of requests (anywhere from 5-50) after which they''d become "busy" again, and get force killed by master. This pattern happened 3 times over a 7 minute period.> Has it happened again?No> Anything else going on with the system at that time? Swapping, particularly...No swap activity or high load. Our munin graphs indicate a peak of web/app server disk latency around that time, although our graphs show many other similar peaks, without incident.> And if you''re inside a VM, maybe your neighbors were hogging things. > Large PUT/POST requests which require filesystem I/O are particularly > sensitive to this.We can''t blame virtualization as we''re on dedicated hardware.>> > Is this with Unix or TCP sockets? If it''s over a LAN, maybe there''s >> > still a bad switch/port/cable somewhere (that happens often to me). >> >> TCP sockets, with nginx and unicorn running on the same box. > > OK, that probably rules out a bunch of problems. > Just to be thorough, anything interesting in dmesg or syslogs?Nothing interesting in /var/log/messages. dmesg (no timestamps) includes lines like: TCP: Treason uncloaked! Peer ###.###.###.###:63554/80 shrinks window 1149440448:1149447132. Repaired.>> > With Unix sockets, I don''t recall encountering recent problems under >> > Linux. Which OS are you running? >> >> Stock RHEL 5, kernel 2.6.18. > > RHEL 5.0 or 5.x? I can''t remember /that/ far back to 5.0 (I don''t think > I even tried it until 5.2), but don''t recall anything being obviously > broken in those...We''re on RHEL 5.6. Thanks, -Nick _______________________________________________ Unicorn mailing list - mongrel-unicorn@rubyforge.org http://rubyforge.org/mailman/listinfo/mongrel-unicorn Do not quote signatures (like this one) or top post when replying
nick@auger.net wrote:> "Eric Wong" <normalperson@yhbt.net> said: > > This is really strange. This was only really bad for a 7s period? > > It was a 7 minute period. All of the workers would become busy and > exceed their >120s timeout. Master would kill and re-spawn them, > they''d start to respond to a handful of requests (anywhere from 5-50) > after which they''d become "busy" again, and get force killed by > master. This pattern happened 3 times over a 7 minute period. > > > Has it happened again? > > No > > > Anything else going on with the system at that time? Swapping, > > particularly... > > No swap activity or high load. Our munin graphs indicate a peak of > web/app server disk latency around that time, although our graphs show > many other similar peaks, without incident.I''m stumped :< Do you have any background threads running that could be hanging the workers? This is Ruby 1.8, after all, so there''s more likely to be some blocking call hanging the entire process. AFAIK, some monitoring software runs a background thread in the unicorn worker and maybe the OpenSSL extension doesn''t work as well if it encountered network problems under Ruby 1.8 Otherwise, I don''t know... _______________________________________________ Unicorn mailing list - mongrel-unicorn@rubyforge.org http://rubyforge.org/mailman/listinfo/mongrel-unicorn Do not quote signatures (like this one) or top post when replying
"Eric Wong" <normalperson@yhbt.net> said:> nick@auger.net wrote: >> "Eric Wong" <normalperson@yhbt.net> said: >> > This is really strange. This was only really bad for a 7s period? >> >> It was a 7 minute period. All of the workers would become busy and >> exceed their >120s timeout. Master would kill and re-spawn them, >> they''d start to respond to a handful of requests (anywhere from 5-50) >> after which they''d become "busy" again, and get force killed by >> master. This pattern happened 3 times over a 7 minute period. >> >> > Has it happened again? >> >> No >> >> > Anything else going on with the system at that time? Swapping, >> > particularly... >> >> No swap activity or high load. Our munin graphs indicate a peak of >> web/app server disk latency around that time, although our graphs show >> many other similar peaks, without incident. > > I''m stumped :<I was afraid you''d say that :(.> Do you have any background threads running that could be hanging the > workers? This is Ruby 1.8, after all, so there''s more likely to be > some blocking call hanging the entire process. AFAIK, some monitoring > software runs a background thread in the unicorn worker and maybe the > OpenSSL extension doesn''t work as well if it encountered network > problems under Ruby 1.8We don''t explicitly create any threads in our rails code. We do communicate with backgroundrb worker processes, although, none of the strangeness today involved any routes that would hit backgroundrb workers. Is there any instrumentation that I could add that might help debugging in the future? ($request_time and $upstream_response_time are now in my nginx logs.) We have noticed these "unexplainable timeouts" before, but typically for a single worker. If there''s some debugging that could be added I might be able to track it down during these one-off events. Thoughts? I absolutely appreciate all your help! -Nick _______________________________________________ Unicorn mailing list - mongrel-unicorn@rubyforge.org http://rubyforge.org/mailman/listinfo/mongrel-unicorn Do not quote signatures (like this one) or top post when replying
nick@auger.net wrote:> "Eric Wong" <normalperson@yhbt.net> said: > > nick@auger.net wrote: > >> "Eric Wong" <normalperson@yhbt.net> said: > > I''m stumped :< > > I was afraid you''d say that :(.Actually, another potential issue is DNS lookups timing out. But they shouldn''t take *that* long...> > Do you have any background threads running that could be hanging the > > workers? This is Ruby 1.8, after all, so there''s more likely to be > > some blocking call hanging the entire process. AFAIK, some monitoring > > software runs a background thread in the unicorn worker and maybe the > > OpenSSL extension doesn''t work as well if it encountered network > > problems under Ruby 1.8 > > We don''t explicitly create any threads in our rails code. We do > communicate with backgroundrb worker processes, although, none of the > strangeness today involved any routes that would hit backgroundrb > workers.I proactively audit every piece of code (including external libraries/gems) loaded by an app for potentially blocking calls (hits to the filesystem, socket calls w/o timeout/blocking). I use strace to help me find that sometimes...> Is there any instrumentation that I could add that might help > debugging in the future? ($request_time and $upstream_response_time > are now in my nginx logs.) We have noticed these "unexplainable > timeouts" before, but typically for a single worker. If there''s some > debugging that could be added I might be able to track it down during > these one-off events.As an experiment, can you replay traffic a few minutes leading up to and including that 7m period in a test setup with only one straced worker? Run "strace -T -f -o $FILE -p $PID_OF_WORKER" and see if there''s any unexpected/surprising dependencies (connect() to unrecognized addresses, open() to networked filesystems, fcntl locks, etc...). You can play around with some other strace options (-v/-s SIZE/-e filters) Maybe you''ll find something, there. _______________________________________________ Unicorn mailing list - mongrel-unicorn@rubyforge.org http://rubyforge.org/mailman/listinfo/mongrel-unicorn Do not quote signatures (like this one) or top post when replying
"Eric Wong" <normalperson@yhbt.net> said:> nick@auger.net wrote: >> "Eric Wong" <normalperson@yhbt.net> said: >> > nick@auger.net wrote: >> >> "Eric Wong" <normalperson@yhbt.net> said: >> > I''m stumped :< >> >> I was afraid you''d say that :(. > > Actually, another potential issue is DNS lookups timing out. But they > shouldn''t take *that* long... > >> > Do you have any background threads running that could be hanging the >> > workers? This is Ruby 1.8, after all, so there''s more likely to be >> > some blocking call hanging the entire process. AFAIK, some monitoring >> > software runs a background thread in the unicorn worker and maybe the >> > OpenSSL extension doesn''t work as well if it encountered network >> > problems under Ruby 1.8 >> >> We don''t explicitly create any threads in our rails code. We do >> communicate with backgroundrb worker processes, although, none of the >> strangeness today involved any routes that would hit backgroundrb >> workers. > > I proactively audit every piece of code (including external > libraries/gems) loaded by an app for potentially blocking calls (hits to > the filesystem, socket calls w/o timeout/blocking). I use strace to > help me find that sometimes... > >> Is there any instrumentation that I could add that might help >> debugging in the future? ($request_time and $upstream_response_time >> are now in my nginx logs.) We have noticed these "unexplainable >> timeouts" before, but typically for a single worker. If there''s some >> debugging that could be added I might be able to track it down during >> these one-off events. > > As an experiment, can you replay traffic a few minutes leading up to and > including that 7m period in a test setup with only one straced worker?I''ve replayed this in a development environment without triggering the issue. Lack of POST payloads and slight differences between the two environments, make this difficult to test.> Run "strace -T -f -o $FILE -p $PID_OF_WORKER" and see if there''s any > unexpected/surprising dependencies (connect() to unrecognized addresses, > open() to networked filesystems, fcntl locks, etc...). > > You can play around with some other strace options (-v/-s SIZE/-e filters)I''ll do my best to look at the output for anything out of the ordinary. Unfortunately, strace is a new tool for me and is a bit out of my wheelhouse. _______________________________________________ Unicorn mailing list - mongrel-unicorn@rubyforge.org http://rubyforge.org/mailman/listinfo/mongrel-unicorn Do not quote signatures (like this one) or top post when replying
fwiw, we experience this too irregularly, about once a week or 2 weeks for a few seconds, and sometimes up to a minute or 2 minutes long. for no apparent reasons indeed. and at random times. my suspicion so far has always been that this is due to bad (wifi) networks of clients. what happens when a request is handled successfully by rails, but for whatever reason the last bit of the response can''t be delivered because the receiving end is not consuming in a timely fashion? If i''m not mistaken i think that rails logs a 200 success message before the response is actually send back? Is a response send back by rails / unicorn fully buffered by nginx? Or is a unicorn worker blocked until the response is confirmed to have been successfully received by the client? Aside: how do you protect your unicorn workers against a barrage of client requests that eat the response veeeeeery sloooooooowly? -- JS _______________________________________________ Unicorn mailing list - mongrel-unicorn@rubyforge.org http://rubyforge.org/mailman/listinfo/mongrel-unicorn Do not quote signatures (like this one) or top post when replying
Jimmy Soho <jimmy.soho@gmail.com> wrote:> fwiw, we experience this too irregularly, about once a week or 2 weeks > for a few seconds, and sometimes up to a minute or 2 minutes long. for > no apparent reasons indeed. and at random times.Wait, based on what you say below, is this happening to unicorn or nginx?> my suspicion so far has always been that this is due to bad (wifi) > networks of clients. what happens when a request is handled > successfully by rails, but for whatever reason the last bit of the > response can''t be delivered because the receiving end is not consuming > in a timely fashion?So nginx is complaining about disconnected clients, not unicorn, correct?> If i''m not mistaken i think that rails logs a 200 success message > before the response is actually send back?Correct, rails/unicorn won''t know if the client disconnected until nginx breaks the connection to unicorn.> Is a response send back by rails / unicorn fully buffered by nginx? Or > is a unicorn worker blocked until the response is confirmed to have > been successfully received by the client?nginx fully buffers both requests and responses by default. Which leads us to...> Aside: how do you protect your unicorn workers against a barrage of > client requests that eat the response veeeeeery sloooooooowly?That''s the whole point of using nginx with unicorn. nginx shields unicorn from slow clients. _______________________________________________ Unicorn mailing list - mongrel-unicorn@rubyforge.org http://rubyforge.org/mailman/listinfo/mongrel-unicorn Do not quote signatures (like this one) or top post when replying
"Eric Wong" <normalperson@yhbt.net> said:> Jimmy Soho <jimmy.soho@gmail.com> wrote: >> fwiw, we experience this too irregularly, about once a week or 2 weeks >> for a few seconds, and sometimes up to a minute or 2 minutes long. for >> no apparent reasons indeed. and at random times. > > Wait, based on what you say below, is this happening to unicorn or > nginx? > >> my suspicion so far has always been that this is due to bad (wifi) >> networks of clients. what happens when a request is handled >> successfully by rails, but for whatever reason the last bit of the >> response can''t be delivered because the receiving end is not consuming >> in a timely fashion? > > So nginx is complaining about disconnected clients, not unicorn, > correct? > >> If i''m not mistaken i think that rails logs a 200 success message >> before the response is actually send back? > > Correct, rails/unicorn won''t know if the client disconnected until > nginx breaks the connection to unicorn. > >> Is a response send back by rails / unicorn fully buffered by nginx? Or >> is a unicorn worker blocked until the response is confirmed to have >> been successfully received by the client? > > nginx fully buffers both requests and responses by default. > Which leads us to... > >> Aside: how do you protect your unicorn workers against a barrage of >> client requests that eat the response veeeeeery sloooooooowly? > > That''s the whole point of using nginx with unicorn. nginx shields > unicorn from slow clients.Is it possible that the behavior could be related to nginx? Nginx hanging onto the connection too long from the worker, so that the worker never makes it back into the pool and eventually get''s killed for not checking in? _______________________________________________ Unicorn mailing list - mongrel-unicorn@rubyforge.org http://rubyforge.org/mailman/listinfo/mongrel-unicorn Do not quote signatures (like this one) or top post when replying
nick@auger.net wrote:> "Eric Wong" <normalperson@yhbt.net> said: > > Jimmy Soho <jimmy.soho@gmail.com> wrote: > >> Is a response send back by rails / unicorn fully buffered by nginx? Or > >> is a unicorn worker blocked until the response is confirmed to have > >> been successfully received by the client? > > > > nginx fully buffers both requests and responses by default. > > Which leads us to... > > > >> Aside: how do you protect your unicorn workers against a barrage of > >> client requests that eat the response veeeeeery sloooooooowly? > > > > That''s the whole point of using nginx with unicorn. nginx shields > > unicorn from slow clients.> Is it possible that the behavior could be related to nginx? Nginx > hanging onto the connection too long from the worker, so that the > worker never makes it back into the pool and eventually get''s killed > for not checking in?Unlikely. That''d be a serious bug in nginx since nginx was originally designed to do this buffering since the earliest versions. Which version of nginx are you running? Can you be sure "proxy_buffering off" is _not_ set anywhere? _______________________________________________ Unicorn mailing list - mongrel-unicorn@rubyforge.org http://rubyforge.org/mailman/listinfo/mongrel-unicorn Do not quote signatures (like this one) or top post when replying