In May I had problem with mongrels suddenly consuming huge cpu resources for a minute or two and then returning to normal (load average spikes up to 3.8and then back down to a regular 0.2 over the course of 5 minutes, then again 1/2 hour later. or 4 hours later, no predictable rhythm). I posted to Litespeed forums because I thought the problem was there but didn''t get far. And a week later migrated hosting companies and the problem was gone. Now its returned. We make a lot of changes, but I''ve gone over the repo for the last few weeks and can''t see anything structural that should effect it. It only happens with our main front end app (the other two are fine), but happens at all times of day(/night) so doesn''t seem triggered by a heavy load. Basically a mongrel gets stuck on one or two cached files for a few minutes (but still functions fine for other requests, I can ping specific rails pages on all mongrels during this period). strace -e read,write,close produces this repeatedly the whole time (short excerpt of 1000s of lines): close(5) = -1 EBADF (Bad file descriptor) read(5, "GET /flower_delivery/florists_in_covehithe_suffolk_england_uk HTTP/1.1\nAccept: image/gif, image/x-xbitmap, image/jpeg, image"..., 16384) = 473 close(5) = 0 close(5) = -1 EBADF (Bad file descriptor) read(5, "GET /flower_delivery/florists_in_rowde_wiltshire_england_uk HTTP/1.1\nAccept: image/gif, image/x-xbitmap, image/jpeg, image/p"..., 16384) = 471 close(5) = 0 close(5) = -1 EBADF (Bad file descriptor) read(5, "GET /flower_delivery/florists_in_cove_south_suffolk_england_uk HTTP/1.1\nAccept: image/gif, image/x-xbitmap, image/jpeg, imag"..., 16384) 474 close(5) = 0 close(5) = -1 EBADF (Bad file descriptor) read(5, "GET /flower_delivery/florists_in_covehithe_suffolk_england_uk HTTP/1.1\nAccept: image/gif, image/x-xbitmap, image/jpeg, image"..., 16384) = 473 close(5) = 0 close(5) = -1 EBADF (Bad file descriptor) read(5, "GET /flower_delivery/florists_in_rowde_wiltshire_england_uk HTTP/1.1\nAccept: image/gif, image/x-xbitmap, image/jpeg, image/p"..., 16384) = 471 close(5) = 0 close(5) = -1 EBADF (Bad file descriptor) read(5, "GET /flower_delivery/florists_in_cove_south_suffolk_england_uk HTTP/1.1\nAccept: image/gif, image/x-xbitmap, image/jpeg, imag"..., 16384) 474 close(5) = 0 close(5) = -1 EBADF (Bad file descriptor) read(5, "GET /flower_delivery/florists_in_covehithe_suffolk_england_uk HTTP/1.1\nAccept: image/gif, image/x-xbitmap, image/jpeg, image"..., 16384) = 473 close(5) = 0 close(5) = -1 EBADF (Bad file descriptor) read(5, "GET /flower_delivery/florists_in_rowde_wiltshire_england_uk HTTP/1.1\nAccept: image/gif, image/x-xbitmap, image/jpeg, image/p"..., 16384) = 471 close(5) = 0 close(5) = -1 EBADF (Bad file descriptor) read(5, "GET /flower_delivery/florists_in_cove_south_suffolk_england_uk HTTP/1.1\nAccept: image/gif, image/x-xbitmap, image/jpeg, imag"..., 16384) 474 close(5) = 0 close(5) = -1 EBADF (Bad file descriptor) read(5, "GET /flower_delivery/florists_in_covehithe_suffolk_england_uk HTTP/1.1\nAccept: image/gif, image/x-xbitmap, image/jpeg, image"..., 16384) = 473 close(5) = 0 close(5) = -1 EBADF (Bad file descriptor) read(5, "GET /flower_delivery/florists_in_rowde_wiltshire_england_uk HTTP/1.1\nAccept: image/gif, image/x-xbitmap, image/jpeg, image/p"..., 16384) = 471 close(5) = 0 close(5) = -1 EBADF (Bad file descriptor) read(5, "GET /flower_delivery/florists_in_cove_south_suffolk_england_uk HTTP/1.1\nAccept: image/gif, image/x-xbitmap, image/jpeg, imag"..., 16384) 474 close(5) = 0 close(5) = -1 EBADF (Bad file descriptor) read(5, "GET /flower_delivery/florists_in_covehithe_suffolk_england_uk HTTP/1.1\nAccept: image/gif, image/x-xbitmap, image/jpeg, image"..., 16384) = 473 close(5) = 0 close(5) = -1 EBADF (Bad file descriptor) the file its trying to get is page cached, and exists/is fine (can even go to url while this is going on). Could it still be a problem with Litespeed (actually requesting this file many times?). Litespeeds cpu usuage does go up during this period, but stracing it doesn''t give anything useful. thanks for any tips/directions. Zach -------------- next part -------------- An HTML attachment was scrubbed... URL: http://rubyforge.org/pipermail/mongrel-users/attachments/20071023/63837d95/attachment-0001.html
What system architecture are you using? Evan On 10/23/07, Zachary Powell <zach at plugthegap.co.uk> wrote:> In May I had problem with mongrels suddenly consuming huge cpu resources for > a minute or two and then returning to normal (load average spikes up to 3.8 > and then back down to a regular 0.2 over the course of 5 minutes, then again > 1/2 hour later. or 4 hours later, no predictable rhythm). > > > I posted to Litespeed forums because I thought the problem was there but > didn''t get far. And a week later migrated hosting companies and the problem > was gone. Now its returned. We make a lot of changes, but I''ve gone over the > repo for the last few weeks and can''t see anything structural that should > effect it. It only happens with our main front end app (the other two are > fine), but happens at all times of day(/night) so doesn''t seem triggered by > a heavy load. Basically a mongrel gets stuck on one or two cached files for > a few minutes (but still functions fine for other requests, I can ping > specific rails pages on all mongrels during this period). > > > strace -e read,write,close produces this repeatedly the whole time (short > excerpt of 1000s of lines): > > close(5) = -1 EBADF (Bad file descriptor) > read(5, "GET > /flower_delivery/florists_in_covehithe_suffolk_england_uk > HTTP/1.1\nAccept: image/gif, image/x-xbitmap, image/jpeg, image"..., 16384) > = 473 > close(5) = 0 > close(5) = -1 EBADF (Bad file descriptor) > read(5, "GET > /flower_delivery/florists_in_rowde_wiltshire_england_uk > HTTP/1.1\nAccept: image/gif, image/x-xbitmap, image/jpeg, image/p"..., > 16384) = 471 > close(5) = 0 > close(5) = -1 EBADF (Bad file descriptor) > read(5, "GET > /flower_delivery/florists_in_cove_south_suffolk_england_uk > HTTP/1.1\nAccept: image/gif, image/x-xbitmap, image/jpeg, imag"..., 16384) > 474 > close(5) = 0 > close(5) = -1 EBADF (Bad file descriptor) > read(5, "GET > /flower_delivery/florists_in_covehithe_suffolk_england_uk > HTTP/1.1\nAccept: image/gif, image/x-xbitmap, image/jpeg, image"..., 16384) > = 473 > close(5) = 0 > close(5) = -1 EBADF (Bad file descriptor) > read(5, "GET > /flower_delivery/florists_in_rowde_wiltshire_england_uk > HTTP/1.1\nAccept: image/gif, image/x-xbitmap, image/jpeg, image/p"..., > 16384) = 471 > close(5) = 0 > close(5) = -1 EBADF (Bad file descriptor) > read(5, "GET > /flower_delivery/florists_in_cove_south_suffolk_england_uk > HTTP/1.1\nAccept: image/gif, image/x-xbitmap, image/jpeg, imag"..., 16384) > 474 > close(5) = 0 > close(5) = -1 EBADF (Bad file descriptor) > read(5, "GET > /flower_delivery/florists_in_covehithe_suffolk_england_uk > HTTP/1.1\nAccept: image/gif, image/x-xbitmap, image/jpeg, image"..., 16384) > = 473 > close(5) = 0 > close(5) = -1 EBADF (Bad file descriptor) > read(5, "GET > /flower_delivery/florists_in_rowde_wiltshire_england_uk > HTTP/1.1\nAccept: image/gif, image/x-xbitmap, image/jpeg, image/p"..., > 16384) = 471 > close(5) = 0 > close(5) = -1 EBADF (Bad file descriptor) > read(5, "GET > /flower_delivery/florists_in_cove_south_suffolk_england_uk > HTTP/1.1\nAccept: image/gif, image/x-xbitmap, image/jpeg, imag"..., 16384) > 474 > close(5) = 0 > close(5) = -1 EBADF (Bad file descriptor) > read(5, "GET > /flower_delivery/florists_in_covehithe_suffolk_england_uk > HTTP/1.1\nAccept: image/gif, image/x-xbitmap, image/jpeg, image"..., 16384) > = 473 > close(5) = 0 > close(5) = -1 EBADF (Bad file descriptor) > read(5, "GET > /flower_delivery/florists_in_rowde_wiltshire_england_uk > HTTP/1.1\nAccept: image/gif, image/x-xbitmap, image/jpeg, image/p"..., > 16384) = 471 > close(5) = 0 > close(5) = -1 EBADF (Bad file descriptor) > read(5, "GET > /flower_delivery/florists_in_cove_south_suffolk_england_uk > HTTP/1.1\nAccept: image/gif, image/x-xbitmap, image/jpeg, imag"..., 16384) > 474 > close(5) = 0 > close(5) = -1 EBADF (Bad file descriptor) > read(5, "GET > /flower_delivery/florists_in_covehithe_suffolk_england_uk > HTTP/1.1\nAccept: image/gif, image/x-xbitmap, image/jpeg, image"..., 16384) > = 473 > close(5) = 0 > close(5) = -1 EBADF (Bad file descriptor) > > > the file its trying to get is page cached, and exists/is fine (can even go > to url while this is going on). > > > Could it still be a problem with Litespeed (actually requesting this file > many times?). Litespeeds cpu usuage does go up during this period, but > stracing it doesn''t give anything useful. > > thanks for any tips/directions. > > Zach > _______________________________________________ > Mongrel-users mailing list > Mongrel-users at rubyforge.org > http://rubyforge.org/mailman/listinfo/mongrel-users >-- Evan Weaver Cloudburst, LLC
On 10/23/07, Zachary Powell <zach at plugthegap.co.uk> wrote:> > In May I had problem with mongrels suddenly consuming huge cpu resources > for a minute or two and then returning to normal (load average spikes up to > 3.8 and then back down to a regular 0.2 over the course of 5 minutes, then > again 1/2 hour later. or 4 hours later, no predictable rhythm). > > > I posted to Litespeed forums because I thought the problem was there but > didn''t get far. And a week later migrated hosting companies and the problem > was gone. Now its returned. We make a lot of changes, but I''ve gone over the > repo for the last few weeks and can''t see anything structural that should > effect it. It only happens with our main front end app (the other two are > fine), but happens at all times of day(/night) so doesn''t seem triggered by > a heavy load. Basically a mongrel gets stuck on one or two cached files for > a few minutes (but still functions fine for other requests, I can ping > specific rails pages on all mongrels during this period). > > > strace -e read,write,close produces this repeatedly the whole time (short > excerpt of 1000s of lines): > > close(5) = -1 EBADF (Bad file descriptor) > read(5, "GET /flower_delivery/florists_in_covehithe_suffolk_england_uk > HTTP/1.1\nAccept: image/gif, image/x-xbitmap, image/jpeg, image"..., 16384) > = 473 > <snip> >It appears that the file is either being prematurely closed, doesn''t exist or some other file related error is occurring. Please give us the details of your setup and architecture. The first thing I would check is your open file limit. If this process is opening more files than the user it is running as is allowed to open, then the OS will begin denying the process files it tries to open. This can be caused either by bad code not closing files or perhaps your app simply has to open that many files due to application design and load. Either way, check the limit and compare to how many files the process has open at the time you see this occurring. I hope that this helps, ~Wayne -------------- next part -------------- An HTML attachment was scrubbed... URL: http://rubyforge.org/pipermail/mongrel-users/attachments/20071023/eea455c7/attachment.html
Are you using a web server in front of your mongrels ? It should be picking up the page cached file before even considering handing the request to a mongrel. Cheers Dave On 24/10/2007, at 11:30 AM, Zachary Powell wrote:> close(5) = 0 > close(5) = -1 EBADF (Bad file > descriptor) > read(5, "GET /flower_delivery/ > florists_in_covehithe_suffolk_england_uk HTTP/1.1\nAccept: image/ > gif, image/x-xbitmap, image/jpeg, image"..., 16384) = 473 > close(5) = 0 > close(5) = -1 EBADF (Bad file > descriptor) > > > the file its trying to get is page cached, and exists/is fine (can > even go to url while this is going on). > >-------------- next part -------------- An HTML attachment was scrubbed... URL: http://rubyforge.org/pipermail/mongrel-users/attachments/20071024/a802717a/attachment.html
Hi Guys, I''m using latest Litespeed 3.2.4 (just upgraded), Mongrel 1.0.1, and Ruby 1.8.6, running Red Hat Enterprise Linux ES 4. We have Web/Mysql/Mail server: # RAID Configuration: RAID 1 (73GBx2) # HP Memory: 2 GB HP RAM # HP DL385 G2 Processor: Dual Socket Dual Core Opteron 2214 2.2 GHz (runs between 0.10 - 0.20, mostly MySql, but spikes up when issue occurs with litespeed taking ~30% cpu) and App Server: # RAID Configuration: RAID 1 (146GBx2) # HP Memory: 4 GB HP RAM # HP DL385 G2 Processor: Dual Socket Dual Core Opteron 2214 2.2 GHz (usually 0.20 - 0.60 with legitimate spikes up to 1.8 for backend processes. Spikes up to 2-4 when it happens, depending on how many mongrels get the problem (sometimes 2)) And these are the mongrels running: MONGREL CPU MEM VIR RES DATE TIME PID 8010:hq 3.2 3.4 145 138 Oct23 43:13 20409 8011:hq 0.6 3.0 132 125 Oct23 8:15 20412 8012:hq 0.1 1.8 81 74 Oct23 1:28 20415 8015:dhq 0.0 1.0 50 44 02:41 0:08 4775 8016:dhq 0.0 0.7 34 30 02:41 0:01 4778 8017:dhq 0.0 0.7 36 30 02:41 0:01 4781 8020:af 9.0 3.3 143 137 Oct23 114:41 26600 8021:af 5.6 2.0 90 84 Oct23 71:56 26607 8022:af 2.4 1.8 80 74 Oct23 30:37 26578 8025:daf 0.0 1.0 49 42 02:41 0:04 4842 8026:daf 0.0 0.7 34 30 02:41 0:02 4845 8027:daf 0.0 0.7 36 30 02:41 0:02 4848 8030:pr 0.1 1.5 67 61 Oct23 1:50 16528 8031:pr 0.0 0.9 47 40 Oct23 0:17 16532 8032:pr 0.0 0.9 44 38 Oct23 0:13 16536 8035:dpr 0.2 0.7 36 30 12:30 0:02 22335 8036:dpr 0.2 0.7 35 30 12:30 0:02 22338 8037:dpr 0.2 0.7 35 30 12:30 0:02 22341 (the ones starting with D are in dev mode, will try turning them off tonight, I hadn''t considered this a spill over issue, but it happened just now and turning them off didn''t ease it. We had alot less when the problem was occurring before, but also a 1 box set-up). Its the 8020-8022 ones that have trouble. It is indeed picking up the page cache, and while its happening I can go to one of those pages in question, or cat it in SSH with no problems. I''ve monitored the rails log while it was happening and haven''t seen any EBADF spilling over. Though its conceivable that a spike of hits from Google crawl could cause a problem, I could try siege/ab tonight. Not familiar with checking file limits, but this is what I get from googling a command: cat /proc/sys/fs/file-nr: 2920 0 372880 ulimit -a: core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited file size (blocks, -f) unlimited pending signals (-i) 1024 max locked memory (kbytes, -l) 32 max memory size (kbytes, -m) unlimited open files (-n) 1024 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 stack size (kbytes, -s) 10240 cpu time (seconds, -t) unlimited max user processes (-u) 77823 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited Will post the ls -l /proc/{mongrel.pid}/fd/, next time it happens, and see how close its getting to 1024. (Though again it seems strange that it could be having a max open problem when the other non-cached pages on that pid that are definitely opening files work fine). Thanks again, Zach On 10/23/07, Dave Cheney <dave at cheney.net> wrote:> > Are you using a web server in front of your mongrels ? It should be > picking up the page cached file before even considering handing the request > to a mongrel. > > Cheers > > > Dave > > On 24/10/2007, at 11:30 AM, Zachary Powell wrote: > > close(5) = 0 > close(5) = -1 EBADF (Bad file descriptor) > read(5, "GET /flower_delivery/florists_in_covehithe_suffolk_england_uk > HTTP/1.1\nAccept: image/gif, image/x-xbitmap, image/jpeg, image"..., 16384) > = 473 > close(5) = 0 > close(5) = -1 EBADF (Bad file descriptor) > > > the file its trying to get is page cached, and exists/is fine (can even go > to url while this is going on). > > > > > >-------------- next part -------------- An HTML attachment was scrubbed... URL: http://rubyforge.org/pipermail/mongrel-users/attachments/20071024/dc52362d/attachment-0001.html
Hi, short follow up. I did lsof -p 27787 | grep mongrel -c => 64 files open while it was spiking, so it doesn''t look like a max file issue. Looking thorough the list, the only thing of note was mongrel_r 27787 rails 5w unknown /proc/27787/fd/5 (readlink: No such file or directory) 5 was likely the id of the file getting EBADF (usually 4-6). I didn''t catch which file it was before it the problem ended (running lsof was very slow, took 20 secs while problem was occuring), but previously when I''ve checked its always been there and accessable from web. Also, the dir in question is full of files, so it hasn''t been sweeped recently (couldn''t have been cache-expiry while reading or anything weird like that). Zach On 10/24/07, Zachary Powell <zach at plugthegap.co.uk> wrote:> > Hi Guys, > > I''m using latest Litespeed 3.2.4 (just upgraded), Mongrel 1.0.1, and Ruby > 1.8.6, running Red Hat Enterprise Linux ES 4 . We have > > > Web/Mysql/Mail server: > > # RAID Configuration: RAID 1 (73GBx2) > > # HP Memory: 2 GB HP RAM > > # HP DL385 G2 Processor: Dual Socket Dual Core Opteron 2214 2.2 GHz > > > (runs between 0.10 - 0.20, mostly MySql, but spikes up when issue occurs > with litespeed taking ~30% cpu) > > > and App Server: > > > # RAID Configuration: RAID 1 (146GBx2) > > # HP Memory: 4 GB HP RAM > > # HP DL385 G2 Processor: Dual Socket Dual Core Opteron 2214 2.2 GHz > > > (usually 0.20 - 0.60 with legitimate spikes up to 1.8 for backend > processes. Spikes up to 2-4 when it happens, depending on how many mongrels > get the problem (sometimes 2)) > > > And these are the mongrels running: > > > MONGREL CPU MEM VIR RES DATE TIME PID > 8010:hq 3.2 3.4 145 138 Oct23 43:13 20409 > 8011:hq 0.6 3.0 132 125 Oct23 8:15 20412 > 8012:hq 0.1 1.8 81 74 Oct23 1:28 20415 > 8015:dhq 0.0 1.0 50 44 02:41 0:08 4775 > 8016:dhq 0.0 0.7 34 30 02:41 0:01 4778 > 8017:dhq 0.0 0.7 36 30 02:41 0:01 4781 > 8020:af 9.0 3.3 143 137 Oct23 114:41 26600 > 8021:af 5.6 2.0 90 84 Oct23 71:56 26607 > 8022:af 2.4 1.8 80 74 Oct23 30:37 26578 > 8025:daf 0.0 1.0 49 42 02:41 0:04 4842 > 8026:daf 0.0 0.7 34 30 02:41 0:02 4845 > 8027:daf 0.0 0.7 36 30 02:41 0:02 4848 > 8030:pr 0.1 1.5 67 61 Oct23 1:50 16528 > 8031:pr 0.0 0.9 47 40 Oct23 0:17 16532 > 8032:pr 0.0 0.9 44 38 Oct23 0:13 16536 > 8035:dpr 0.2 0.7 36 30 12:30 0:02 22335 > 8036:dpr 0.2 0.7 35 30 12:30 0:02 22338 > 8037:dpr 0.2 0.7 35 30 12:30 0:02 22341 > > > (the ones starting with D are in dev mode, will try turning them off > tonight, I hadn''t considered this a spill over issue, but it happened just > now and turning them off didn''t ease it. We had alot less when the problem > was occurring before, but also a 1 box set-up). > > > Its the 8020-8022 ones that have trouble. It is indeed picking up the page > cache, and while its happening I can go to one of those pages in question, > or cat it in SSH with no problems. I''ve monitored the rails log while it was > happening and haven''t seen any EBADF spilling over. Though > its conceivable that a spike of hits from Google crawl could cause a > problem, I could try siege/ab tonight. > > > Not familiar with checking file limits, but this is what I get from > googling a command: > > > cat /proc/sys/fs/file-nr: > 2920 0 372880 > > > ulimit -a: > core file size (blocks, -c) 0 > data seg size (kbytes, -d) unlimited > file size (blocks, -f) unlimited > pending signals (-i) 1024 > max locked memory (kbytes, -l) 32 > max memory size (kbytes, -m) unlimited > open files (-n) 1024 > pipe size (512 bytes, -p) 8 > POSIX message queues (bytes, -q) 819200 > stack size (kbytes, -s) 10240 > cpu time (seconds, -t) unlimited > max user processes (-u) 77823 > virtual memory (kbytes, -v) unlimited > file locks (-x) unlimited > > > > > Will post the ls -l /proc/{mongrel.pid}/fd/, next time it happens, and see > how close its getting to 1024. (Though again it seems strange that it could > be having a max open problem when the other non-cached pages on that pid > that are definitely opening files work fine). > > > Thanks again, > > > Zach > > > > On 10/23/07, Dave Cheney <dave at cheney.net> wrote: > > > > Are you using a web server in front of your mongrels ? It should be > > picking up the page cached file before even considering handing the request > > to a mongrel. > > > > Cheers > > > > > > Dave > > > > On 24/10/2007, at 11:30 AM, Zachary Powell wrote: > > > > close(5) = 0 > > close(5) = -1 EBADF (Bad file descriptor) > > read(5, "GET /flower_delivery/florists_in_covehithe_suffolk_england_uk > > HTTP/1.1\nAccept: image/gif, image/x-xbitmap, image/jpeg, image"..., 16384) > > = 473 > > close(5) = 0 > > close(5) = -1 EBADF (Bad file descriptor) > > > > > > the file its trying to get is page cached, and exists/is fine (can even > > go to url while this is going on). > > > > > > > > > > > > > > >-------------- next part -------------- An HTML attachment was scrubbed... URL: http://rubyforge.org/pipermail/mongrel-users/attachments/20071024/5924e636/attachment.html
Assuming you have access to a root shell on your hosting service, then when you are watching a slow system, you can start up a root shell and renice that shell to a value of -1. This will give the shell a higher priority and your command will be much quicker. For example. bash #: su Password: root #: ps PID TTY TIME CMD 1443 pts/2 00:00:00 bash 1447 pts/2 00:00:00 ps root #: renice -1 1443 root #: suspend bash #: Use the suspend to leave the root job around and then foreground it when you have the problem and use it. If you don''t have a root shell, but have sudo, then use sudo nice -1 ps of -p 27787 etc. - Jim Hogue ----- Original Message ----- From: "Zachary Powell" <zach at plugthegap.co.uk> Cc: <mongrel-users at rubyforge.org> Sent: Wednesday, October 24, 2007 9:14 AM Subject: Re: [Mongrel] random cpu spikes, EBADF errors> Hi, short follow up. I did > lsof -p 27787 | grep mongrel -c > => 64 files open > > while it was spiking, so it doesn''t look like a max file issue. > Looking > thorough the list, the only thing of note was > > mongrel_r 27787 rails 5w unknown > /proc/27787/fd/5 (readlink: No such file or directory) > > 5 was likely the id of the file getting EBADF (usually 4-6). I > didn''t catch > which file it was before it the problem ended (running lsof was very > slow, > took 20 secs while problem was occuring), but previously when I''ve > checked > its always been there and accessable from web. Also, the dir in > question is > full of files, so it hasn''t been sweeped recently (couldn''t have > been > cache-expiry while reading or anything weird like that). > > Zach > > > > On 10/24/07, Zachary Powell <zach at plugthegap.co.uk> wrote: >> >> Hi Guys, >> >> I''m using latest Litespeed 3.2.4 (just upgraded), Mongrel 1.0.1, >> and Ruby >> 1.8.6, running Red Hat Enterprise Linux ES 4 . We have >> >> >> Web/Mysql/Mail server: >> >> # RAID Configuration: RAID 1 (73GBx2) >> >> # HP Memory: 2 GB HP RAM >> >> # HP DL385 G2 Processor: Dual Socket Dual Core Opteron 2214 2.2 GHz >> >> >> (runs between 0.10 - 0.20, mostly MySql, but spikes up when issue >> occurs >> with litespeed taking ~30% cpu) >> >> >> and App Server: >> >> >> # RAID Configuration: RAID 1 (146GBx2) >> >> # HP Memory: 4 GB HP RAM >> >> # HP DL385 G2 Processor: Dual Socket Dual Core Opteron 2214 2.2 GHz >> >> >> (usually 0.20 - 0.60 with legitimate spikes up to 1.8 for backend >> processes. Spikes up to 2-4 when it happens, depending on how many >> mongrels >> get the problem (sometimes 2)) >> >> >> And these are the mongrels running: >> >> >> MONGREL CPU MEM VIR RES DATE TIME PID >> 8010:hq 3.2 3.4 145 138 Oct23 43:13 20409 >> 8011:hq 0.6 3.0 132 125 Oct23 8:15 20412 >> 8012:hq 0.1 1.8 81 74 Oct23 1:28 20415 >> 8015:dhq 0.0 1.0 50 44 02:41 0:08 4775 >> 8016:dhq 0.0 0.7 34 30 02:41 0:01 4778 >> 8017:dhq 0.0 0.7 36 30 02:41 0:01 4781 >> 8020:af 9.0 3.3 143 137 Oct23 114:41 26600 >> 8021:af 5.6 2.0 90 84 Oct23 71:56 26607 >> 8022:af 2.4 1.8 80 74 Oct23 30:37 26578 >> 8025:daf 0.0 1.0 49 42 02:41 0:04 4842 >> 8026:daf 0.0 0.7 34 30 02:41 0:02 4845 >> 8027:daf 0.0 0.7 36 30 02:41 0:02 4848 >> 8030:pr 0.1 1.5 67 61 Oct23 1:50 16528 >> 8031:pr 0.0 0.9 47 40 Oct23 0:17 16532 >> 8032:pr 0.0 0.9 44 38 Oct23 0:13 16536 >> 8035:dpr 0.2 0.7 36 30 12:30 0:02 22335 >> 8036:dpr 0.2 0.7 35 30 12:30 0:02 22338 >> 8037:dpr 0.2 0.7 35 30 12:30 0:02 22341 >> >> >> (the ones starting with D are in dev mode, will try turning them >> off >> tonight, I hadn''t considered this a spill over issue, but it >> happened just >> now and turning them off didn''t ease it. We had alot less when the >> problem >> was occurring before, but also a 1 box set-up). >> >> >> Its the 8020-8022 ones that have trouble. It is indeed picking up >> the page >> cache, and while its happening I can go to one of those pages in >> question, >> or cat it in SSH with no problems. I''ve monitored the rails log >> while it was >> happening and haven''t seen any EBADF spilling over. Though >> its conceivable that a spike of hits from Google crawl could cause >> a >> problem, I could try siege/ab tonight. >> >> >> Not familiar with checking file limits, but this is what I get from >> googling a command: >> >> >> cat /proc/sys/fs/file-nr: >> 2920 0 372880 >> >> >> ulimit -a: >> core file size (blocks, -c) 0 >> data seg size (kbytes, -d) unlimited >> file size (blocks, -f) unlimited >> pending signals (-i) 1024 >> max locked memory (kbytes, -l) 32 >> max memory size (kbytes, -m) unlimited >> open files (-n) 1024 >> pipe size (512 bytes, -p) 8 >> POSIX message queues (bytes, -q) 819200 >> stack size (kbytes, -s) 10240 >> cpu time (seconds, -t) unlimited >> max user processes (-u) 77823 >> virtual memory (kbytes, -v) unlimited >> file locks (-x) unlimited >> >> >> >> >> Will post the ls -l /proc/{mongrel.pid}/fd/, next time it happens, >> and see >> how close its getting to 1024. (Though again it seems strange that >> it could >> be having a max open problem when the other non-cached pages on >> that pid >> that are definitely opening files work fine). >> >> >> Thanks again, >> >> >> Zach >> >> >> >> On 10/23/07, Dave Cheney <dave at cheney.net> wrote: >> > >> > Are you using a web server in front of your mongrels ? It should >> > be >> > picking up the page cached file before even considering handing >> > the request >> > to a mongrel. >> > >> > Cheers >> > >> > >> > Dave >> > >> > On 24/10/2007, at 11:30 AM, Zachary Powell wrote: >> > >> > close(5) = 0 >> > close(5) = -1 EBADF (Bad file >> > descriptor) >> > read(5, "GET >> > /flower_delivery/florists_in_covehithe_suffolk_england_uk >> > HTTP/1.1\nAccept: image/gif, image/x-xbitmap, image/jpeg, >> > image"..., 16384) >> > = 473 >> > close(5) = 0 >> > close(5) = -1 EBADF (Bad file >> > descriptor) >> > >> > >> > the file its trying to get is page cached, and exists/is fine >> > (can even >> > go to url while this is going on). >> > >> > >> > >> > >> > >> > >> >> >> >--------------------------------------------------------------------------------> _______________________________________________ > Mongrel-users mailing list > Mongrel-users at rubyforge.org > http://rubyforge.org/mailman/listinfo/mongrel-users
Whoops, that nice command needs a --1 as in sudo nice --1 ps of -p 27787 - Jim ----- Original Message ----- From: "Jim Hogue" <jjhogue at sbcglobal.net> To: <zach at plugthegap.co.uk>; <mongrel-users at rubyforge.org> Sent: Wednesday, October 24, 2007 11:07 AM Subject: Re: [Mongrel] random cpu spikes, EBADF errors> Assuming you have access to a root shell on your hosting service, > then > when you are watching a slow system, you can start up a root shell > and > renice that shell to a value of -1. This will give the shell a > higher > priority and your command will be much quicker. For example. > > bash #: su > Password: > root #: ps > PID TTY TIME CMD > 1443 pts/2 00:00:00 bash > 1447 pts/2 00:00:00 ps > root #: renice -1 1443 > root #: suspend > bash #: > > Use the suspend to leave the root job around and then foreground it > when you have the problem and use it. > > If you don''t have a root shell, but have sudo, then use > > sudo nice -1 ps of -p 27787 > etc. > > - Jim Hogue > > ----- Original Message ----- > From: "Zachary Powell" <zach at plugthegap.co.uk> > Cc: <mongrel-users at rubyforge.org> > Sent: Wednesday, October 24, 2007 9:14 AM > Subject: Re: [Mongrel] random cpu spikes, EBADF errors > > >> Hi, short follow up. I did >> lsof -p 27787 | grep mongrel -c >> => 64 files open >> >> while it was spiking, so it doesn''t look like a max file issue. >> Looking >> thorough the list, the only thing of note was >> >> mongrel_r 27787 rails 5w unknown >> /proc/27787/fd/5 (readlink: No such file or directory) >> >> 5 was likely the id of the file getting EBADF (usually 4-6). I >> didn''t catch >> which file it was before it the problem ended (running lsof was >> very >> slow, >> took 20 secs while problem was occuring), but previously when I''ve >> checked >> its always been there and accessable from web. Also, the dir in >> question is >> full of files, so it hasn''t been sweeped recently (couldn''t have >> been >> cache-expiry while reading or anything weird like that). >> >> Zach >> >> >> >> On 10/24/07, Zachary Powell <zach at plugthegap.co.uk> wrote: >>> >>> Hi Guys, >>> >>> I''m using latest Litespeed 3.2.4 (just upgraded), Mongrel 1.0.1, >>> and Ruby >>> 1.8.6, running Red Hat Enterprise Linux ES 4 . We have >>> >>> >>> Web/Mysql/Mail server: >>> >>> # RAID Configuration: RAID 1 (73GBx2) >>> >>> # HP Memory: 2 GB HP RAM >>> >>> # HP DL385 G2 Processor: Dual Socket Dual Core Opteron 2214 2.2 >>> GHz >>> >>> >>> (runs between 0.10 - 0.20, mostly MySql, but spikes up when issue >>> occurs >>> with litespeed taking ~30% cpu) >>> >>> >>> and App Server: >>> >>> >>> # RAID Configuration: RAID 1 (146GBx2) >>> >>> # HP Memory: 4 GB HP RAM >>> >>> # HP DL385 G2 Processor: Dual Socket Dual Core Opteron 2214 2.2 >>> GHz >>> >>> >>> (usually 0.20 - 0.60 with legitimate spikes up to 1.8 for backend >>> processes. Spikes up to 2-4 when it happens, depending on how many >>> mongrels >>> get the problem (sometimes 2)) >>> >>> >>> And these are the mongrels running: >>> >>> >>> MONGREL CPU MEM VIR RES DATE TIME PID >>> 8010:hq 3.2 3.4 145 138 Oct23 43:13 20409 >>> 8011:hq 0.6 3.0 132 125 Oct23 8:15 20412 >>> 8012:hq 0.1 1.8 81 74 Oct23 1:28 20415 >>> 8015:dhq 0.0 1.0 50 44 02:41 0:08 4775 >>> 8016:dhq 0.0 0.7 34 30 02:41 0:01 4778 >>> 8017:dhq 0.0 0.7 36 30 02:41 0:01 4781 >>> 8020:af 9.0 3.3 143 137 Oct23 114:41 26600 >>> 8021:af 5.6 2.0 90 84 Oct23 71:56 26607 >>> 8022:af 2.4 1.8 80 74 Oct23 30:37 26578 >>> 8025:daf 0.0 1.0 49 42 02:41 0:04 4842 >>> 8026:daf 0.0 0.7 34 30 02:41 0:02 4845 >>> 8027:daf 0.0 0.7 36 30 02:41 0:02 4848 >>> 8030:pr 0.1 1.5 67 61 Oct23 1:50 16528 >>> 8031:pr 0.0 0.9 47 40 Oct23 0:17 16532 >>> 8032:pr 0.0 0.9 44 38 Oct23 0:13 16536 >>> 8035:dpr 0.2 0.7 36 30 12:30 0:02 22335 >>> 8036:dpr 0.2 0.7 35 30 12:30 0:02 22338 >>> 8037:dpr 0.2 0.7 35 30 12:30 0:02 22341 >>> >>> >>> (the ones starting with D are in dev mode, will try turning them >>> off >>> tonight, I hadn''t considered this a spill over issue, but it >>> happened just >>> now and turning them off didn''t ease it. We had alot less when the >>> problem >>> was occurring before, but also a 1 box set-up). >>> >>> >>> Its the 8020-8022 ones that have trouble. It is indeed picking up >>> the page >>> cache, and while its happening I can go to one of those pages in >>> question, >>> or cat it in SSH with no problems. I''ve monitored the rails log >>> while it was >>> happening and haven''t seen any EBADF spilling over. Though >>> its conceivable that a spike of hits from Google crawl could cause >>> a >>> problem, I could try siege/ab tonight. >>> >>> >>> Not familiar with checking file limits, but this is what I get >>> from >>> googling a command: >>> >>> >>> cat /proc/sys/fs/file-nr: >>> 2920 0 372880 >>> >>> >>> ulimit -a: >>> core file size (blocks, -c) 0 >>> data seg size (kbytes, -d) unlimited >>> file size (blocks, -f) unlimited >>> pending signals (-i) 1024 >>> max locked memory (kbytes, -l) 32 >>> max memory size (kbytes, -m) unlimited >>> open files (-n) 1024 >>> pipe size (512 bytes, -p) 8 >>> POSIX message queues (bytes, -q) 819200 >>> stack size (kbytes, -s) 10240 >>> cpu time (seconds, -t) unlimited >>> max user processes (-u) 77823 >>> virtual memory (kbytes, -v) unlimited >>> file locks (-x) unlimited >>> >>> >>> >>> >>> Will post the ls -l /proc/{mongrel.pid}/fd/, next time it happens, >>> and see >>> how close its getting to 1024. (Though again it seems strange that >>> it could >>> be having a max open problem when the other non-cached pages on >>> that pid >>> that are definitely opening files work fine). >>> >>> >>> Thanks again, >>> >>> >>> Zach >>> >>> >>> >>> On 10/23/07, Dave Cheney <dave at cheney.net> wrote: >>> > >>> > Are you using a web server in front of your mongrels ? It should >>> > be >>> > picking up the page cached file before even considering handing >>> > the request >>> > to a mongrel. >>> > >>> > Cheers >>> > >>> > >>> > Dave >>> > >>> > On 24/10/2007, at 11:30 AM, Zachary Powell wrote: >>> > >>> > close(5) = 0 >>> > close(5) = -1 EBADF (Bad file >>> > descriptor) >>> > read(5, "GET >>> > /flower_delivery/florists_in_covehithe_suffolk_england_uk >>> > HTTP/1.1\nAccept: image/gif, image/x-xbitmap, image/jpeg, >>> > image"..., 16384) >>> > = 473 >>> > close(5) = 0 >>> > close(5) = -1 EBADF (Bad file >>> > descriptor) >>> > >>> > >>> > the file its trying to get is page cached, and exists/is fine >>> > (can even >>> > go to url while this is going on). >>> > >>> > >>> > >>> > >>> > >>> > >>> >>> >>> >> > > > -------------------------------------------------------------------------------- > > >> _______________________________________________ >> Mongrel-users mailing list >> Mongrel-users at rubyforge.org >> http://rubyforge.org/mailman/listinfo/mongrel-users > > _______________________________________________ > Mongrel-users mailing list > Mongrel-users at rubyforge.org > http://rubyforge.org/mailman/listinfo/mongrel-users
Hi All, Follow up to the CPU/EBADF issue I was having with lsws: http://www.litespeedtech.com/support/forum/showthread.php?t=1012&goto=newpost Here is the message that has just been posted: *************** The problem is on mongrel side. As shown in the strace output, file handle 5 is the reverse proxy connection from LSWS to mongrel. Mongrel read the request, then it closed the connection immediately without sending back anything, then try to close it again with result EBADF, because the file descriptor has been closed already. When mongrel was working, it should send the reply back to LSWS before closing the socket. The root cause of the problem is on Mongrel side, however, LSWS should fail the request after a few retries. We will implement that in our 3.3 release. *************** Zach On 10/24/07, Zachary Powell <zach at plugthegap.co.uk> wrote:> > Hi, short follow up. I did > > lsof -p 27787 | grep mongrel -c > => 64 files open > > > while it was spiking, so it doesn''t look like a max file issue. Looking > thorough the list, the only thing of note was > > > mongrel_r 27787 rails 5w unknown > /proc/27787/fd/5 (readlink: No such file or directory) > > 5 was likely the id of the file getting EBADF (usually 4-6). I didn''t > catch which file it was before it the problem ended (running lsof was very > slow, took 20 secs while problem was occuring), but previously when I''ve > checked its always been there and accessable from web. Also, the dir in > question is full of files, so it hasn''t been sweeped recently (couldn''t have > been cache-expiry while reading or anything weird like that). > > > Zach > > > > > On 10/24/07, Zachary Powell <zach at plugthegap.co.uk> wrote: > > > > Hi Guys, > > > > I''m using latest Litespeed 3.2.4 (just upgraded), Mongrel 1.0.1, and > > Ruby 1.8.6, running Red Hat Enterprise Linux ES 4 . We have > > > > > > Web/Mysql/Mail server: > > > > # RAID Configuration: RAID 1 (73GBx2) > > > > # HP Memory: 2 GB HP RAM > > > > # HP DL385 G2 Processor: Dual Socket Dual Core Opteron 2214 2.2 GHz > > > > > > (runs between 0.10 - 0.20, mostly MySql, but spikes up when issue occurs > > with litespeed taking ~30% cpu) > > > > > > and App Server: > > > > > > # RAID Configuration: RAID 1 (146GBx2) > > > > # HP Memory: 4 GB HP RAM > > > > # HP DL385 G2 Processor: Dual Socket Dual Core Opteron 2214 2.2 GHz > > > > > > (usually 0.20 - 0.60 with legitimate spikes up to 1.8 for backend > > processes. Spikes up to 2-4 when it happens, depending on how many mongrels > > get the problem (sometimes 2)) > > > > > > And these are the mongrels running: > > > > > > MONGREL CPU MEM VIR RES DATE TIME PID > > 8010:hq 3.2 3.4 145 138 Oct23 43:13 20409 > > 8011:hq 0.6 3.0 132 125 Oct23 8:15 20412 > > 8012:hq 0.1 1.8 81 74 Oct23 1:28 20415 > > 8015:dhq 0.0 1.0 50 44 02:41 0:08 4775 > > 8016:dhq 0.0 0.7 34 30 02:41 0:01 4778 > > 8017:dhq 0.0 0.7 36 30 02:41 0:01 4781 > > 8020:af 9.0 3.3 143 137 Oct23 114:41 26600 > > 8021:af 5.6 2.0 90 84 Oct23 71:56 26607 > > 8022:af 2.4 1.8 80 74 Oct23 30:37 26578 > > 8025:daf 0.0 1.0 49 42 02:41 0:04 4842 > > 8026:daf 0.0 0.7 34 30 02:41 0:02 4845 > > 8027:daf 0.0 0.7 36 30 02:41 0:02 4848 > > 8030:pr 0.1 1.5 67 61 Oct23 1:50 16528 > > 8031:pr 0.0 0.9 47 40 Oct23 0:17 16532 > > 8032:pr 0.0 0.9 44 38 Oct23 0:13 16536 > > 8035:dpr 0.2 0.7 36 30 12:30 0:02 22335 > > 8036:dpr 0.2 0.7 35 30 12:30 0:02 22338 > > 8037:dpr 0.2 0.7 35 30 12:30 0:02 22341 > > > > > > (the ones starting with D are in dev mode, will try turning them off > > tonight, I hadn''t considered this a spill over issue, but it happened just > > now and turning them off didn''t ease it. We had alot less when the problem > > was occurring before, but also a 1 box set-up). > > > > > > Its the 8020-8022 ones that have trouble. It is indeed picking up the > > page cache, and while its happening I can go to one of those pages in > > question, or cat it in SSH with no problems. I''ve monitored the rails log > > while it was happening and haven''t seen any EBADF spilling over. Though > > its conceivable that a spike of hits from Google crawl could cause a > > problem, I could try siege/ab tonight. > > > > > > Not familiar with checking file limits, but this is what I get from > > googling a command: > > > > > > cat /proc/sys/fs/file-nr: > > 2920 0 372880 > > > > > > ulimit -a: > > core file size (blocks, -c) 0 > > data seg size (kbytes, -d) unlimited > > file size (blocks, -f) unlimited > > pending signals (-i) 1024 > > max locked memory (kbytes, -l) 32 > > max memory size (kbytes, -m) unlimited > > open files (-n) 1024 > > pipe size (512 bytes, -p) 8 > > POSIX message queues (bytes, -q) 819200 > > stack size (kbytes, -s) 10240 > > cpu time (seconds, -t) unlimited > > max user processes (-u) 77823 > > virtual memory (kbytes, -v) unlimited > > file locks (-x) unlimited > > > > > > > > > > Will post the ls -l /proc/{mongrel.pid}/fd/, next time it happens, and > > see how close its getting to 1024. (Though again it seems strange that it > > could be having a max open problem when the other non-cached pages on that > > pid that are definitely opening files work fine). > > > > > > Thanks again, > > > > > > Zach > > > > > > > > On 10/23/07, Dave Cheney <dave at cheney.net> wrote: > > > > > > Are you using a web server in front of your mongrels ? It should be > > > picking up the page cached file before even considering handing the request > > > to a mongrel. > > > > > > Cheers > > > > > > > > > Dave > > > > > > On 24/10/2007, at 11:30 AM, Zachary Powell wrote: > > > > > > close(5) = 0 > > > close(5) = -1 EBADF (Bad file > > > descriptor) > > > read(5, "GET /flower_delivery/florists_in_covehithe_suffolk_england_uk > > > HTTP/1.1\nAccept: image/gif, image/x-xbitmap, image/jpeg, image"..., 16384) > > > = 473 > > > close(5) = 0 > > > close(5) = -1 EBADF (Bad file > > > descriptor) > > > > > > > > > the file its trying to get is page cached, and exists/is fine (can > > > even go to url while this is going on). > > > > > > > > > > > > > > > > > > > > > > > > > > >-------------- next part -------------- An HTML attachment was scrubbed... URL: http://rubyforge.org/pipermail/mongrel-users/attachments/20071029/2bd0205e/attachment-0001.html
> When mongrel was working, it should send the reply back to LSWSbefore closing the socket. There''s a string prepared for the purpose in mongre.rb ERROR_503_RESPONSE="HTTP/1.1 503 Service Unavailable\r\n\r\nBUSY".freeze It''s a one-liner to send that to the socket before calling close. Zachary Powell wrote:> Hi All, > > Follow up to the CPU/EBADF issue I was having with lsws: > > > http://www.litespeedtech.com/support/forum/showthread.php?t=1012&goto=newpost > <http://www.litespeedtech.com/support/forum/showthread.php?t=1012&goto=newpost> > Here is the message that has just been posted: > *************** > The problem is on mongrel side. As shown in the strace output, file > handle 5 is the reverse proxy connection from LSWS to mongrel. Mongrel > read the request, then it closed the connection immediately without > sending back anything, then try to close it again with result EBADF, > because the file descriptor has been closed already. > > When mongrel was working, it should send the reply back to LSWS before > closing the socket. > > The root cause of the problem is on Mongrel side, however, LSWS should > fail the request after a few retries. We will implement that in our > 3.3 release. > *************** > > > Zach-------------- next part -------------- A non-text attachment was scrubbed... Name: rob.vcf Type: text/x-vcard Size: 116 bytes Desc: not available Url : http://rubyforge.org/pipermail/mongrel-users/attachments/20071029/194db8f5/attachment.vcf
Perhaps you are exhausting the number of worker threads in the queue available to mongrel (default 900ish I think). If your cached files are very big, maybe they aren''t being served quickly enough by the DirHandler and your queue becomes clogged. Should mongrel definitely send 503 after this state, or not? I think there was some debate recently about the same issue and the resolution was inconclusive. Does Litespeed support x-sendfile? Maybe the DirHandler should be updated to take advantage of that. Evan On Oct 29, 2007 4:27 PM, Robert Mela <rob at robmela.com> wrote:> > When mongrel was working, it should send the reply back to LSWS > before closing the socket. > > There''s a string prepared for the purpose in mongre.rb > > ERROR_503_RESPONSE="HTTP/1.1 503 Service Unavailable\r\n\r\nBUSY".freeze > > It''s a one-liner to send that to the socket before calling close. > > > Zachary Powell wrote: > > Hi All, > > > > Follow up to the CPU/EBADF issue I was having with lsws: > > > > > > http://www.litespeedtech.com/support/forum/showthread.php?t=1012&goto=newpost > > <http://www.litespeedtech.com/support/forum/showthread.php?t=1012&goto=newpost> > > Here is the message that has just been posted: > > *************** > > The problem is on mongrel side. As shown in the strace output, file > > handle 5 is the reverse proxy connection from LSWS to mongrel. Mongrel > > read the request, then it closed the connection immediately without > > sending back anything, then try to close it again with result EBADF, > > because the file descriptor has been closed already. > > > > When mongrel was working, it should send the reply back to LSWS before > > closing the socket. > > > > The root cause of the problem is on Mongrel side, however, LSWS should > > fail the request after a few retries. We will implement that in our > > 3.3 release. > > *************** > > > > > > Zach > > > _______________________________________________ > Mongrel-users mailing list > Mongrel-users at rubyforge.org > http://rubyforge.org/mailman/listinfo/mongrel-users >-- Evan Weaver Cloudburst, LLC
I disagree. As soon as you start putting code specific to other web servers in the core Mongrel, you''ll either have to start adding it for other servers as well (thus bloating the code!), or take a stance on which other servers to support. I think this goes completely against Zed''s originally vision. Zed has written several times about how easy it is to create your own handler (I believe as a Gem plugin), and to configure Mongrel to use it. Plus, if you bundle it as a plugin, you can distribute it separately from the Mongrel core, so those of us who don''t need it won''t have to load code for program we choose not to run. =Will Green Evan Weaver wrote:> Perhaps you are exhausting the number of worker threads in the queue > available to mongrel (default 900ish I think). If your cached files > are very big, maybe they aren''t being served quickly enough by the > DirHandler and your queue becomes clogged. > > Should mongrel definitely send 503 after this state, or not? I think > there was some debate recently about the same issue and the resolution > was inconclusive. > > Does Litespeed support x-sendfile? Maybe the DirHandler should be > updated to take advantage of that. > > Evan > > On Oct 29, 2007 4:27 PM, Robert Mela <rob at robmela.com> wrote: >> > When mongrel was working, it should send the reply back to LSWS >> before closing the socket. >> >> There''s a string prepared for the purpose in mongre.rb >> >> ERROR_503_RESPONSE="HTTP/1.1 503 Service Unavailable\r\n\r\nBUSY".freeze >> >> It''s a one-liner to send that to the socket before calling close. >> >> >> Zachary Powell wrote: >>> Hi All, >>> >>> Follow up to the CPU/EBADF issue I was having with lsws: >>> >>> >>> http://www.litespeedtech.com/support/forum/showthread.php?t=1012&goto=newpost >>> <http://www.litespeedtech.com/support/forum/showthread.php?t=1012&goto=newpost> >>> Here is the message that has just been posted: >>> *************** >>> The problem is on mongrel side. As shown in the strace output, file >>> handle 5 is the reverse proxy connection from LSWS to mongrel. Mongrel >>> read the request, then it closed the connection immediately without >>> sending back anything, then try to close it again with result EBADF, >>> because the file descriptor has been closed already. >>> >>> When mongrel was working, it should send the reply back to LSWS before >>> closing the socket. >>> >>> The root cause of the problem is on Mongrel side, however, LSWS should >>> fail the request after a few retries. We will implement that in our >>> 3.3 release. >>> *************** >>> >>> >>> Zach >> >> _______________________________________________ >> Mongrel-users mailing list >> Mongrel-users at rubyforge.org >> http://rubyforge.org/mailman/listinfo/mongrel-users >> > > >
The Camping handler already has in-Mongrel support for X-SENDFILE. It would make sense to add a configurable option to the DirHandler and Rails handler as well as the Camping handler to actually pass the header along to webservers that support it, so it does what it''s supposed to instead of faking it. Also, X-SENDFILE was actually invented by lighttp, so it''s not a super-proprietary thing if you''re concerned about that. On the other hand, if the webserver doesn''t have a superset of permissions the mongrel does, it could blow up, and if it does, it could possibly be a vector for security breaches. However the X-SENDFILE code in the Camping handler should probably be moved into core somewhere instead of the handler, it''s kind of a weird place to have it. Just saying. I definitely understand your concerns. Really the issue is whether we should return any response when closing a connection due to resource overloading, I think. Evan On Oct 29, 2007 7:18 PM, Will Green <will at hotgazpacho.com> wrote:> I disagree. As soon as you start putting code specific to other web servers in the core Mongrel, > you''ll either have to start adding it for other servers as well (thus bloating the code!), or take a > stance on which other servers to support. I think this goes completely against Zed''s originally vision. > > Zed has written several times about how easy it is to create your own handler (I believe as a Gem > plugin), and to configure Mongrel to use it. Plus, if you bundle it as a plugin, you can distribute > it separately from the Mongrel core, so those of us who don''t need it won''t have to load code for > program we choose not to run. > > => Will Green > > > Evan Weaver wrote: > > Perhaps you are exhausting the number of worker threads in the queue > > available to mongrel (default 900ish I think). If your cached files > > are very big, maybe they aren''t being served quickly enough by the > > DirHandler and your queue becomes clogged. > > > > Should mongrel definitely send 503 after this state, or not? I think > > there was some debate recently about the same issue and the resolution > > was inconclusive. > > > > Does Litespeed support x-sendfile? Maybe the DirHandler should be > > updated to take advantage of that. > > > > Evan > > > > On Oct 29, 2007 4:27 PM, Robert Mela <rob at robmela.com> wrote: > >> > When mongrel was working, it should send the reply back to LSWS > >> before closing the socket. > >> > >> There''s a string prepared for the purpose in mongre.rb > >> > >> ERROR_503_RESPONSE="HTTP/1.1 503 Service Unavailable\r\n\r\nBUSY".freeze > >> > >> It''s a one-liner to send that to the socket before calling close. > >> > >> > >> Zachary Powell wrote: > >>> Hi All, > >>> > >>> Follow up to the CPU/EBADF issue I was having with lsws: > >>> > >>> > >>> http://www.litespeedtech.com/support/forum/showthread.php?t=1012&goto=newpost > >>> <http://www.litespeedtech.com/support/forum/showthread.php?t=1012&goto=newpost> > >>> Here is the message that has just been posted: > >>> *************** > >>> The problem is on mongrel side. As shown in the strace output, file > >>> handle 5 is the reverse proxy connection from LSWS to mongrel. Mongrel > >>> read the request, then it closed the connection immediately without > >>> sending back anything, then try to close it again with result EBADF, > >>> because the file descriptor has been closed already. > >>> > >>> When mongrel was working, it should send the reply back to LSWS before > >>> closing the socket. > >>> > >>> The root cause of the problem is on Mongrel side, however, LSWS should > >>> fail the request after a few retries. We will implement that in our > >>> 3.3 release. > >>> *************** > >>> > >>> > >>> Zach > >> > >> _______________________________________________ > >> Mongrel-users mailing list > >> Mongrel-users at rubyforge.org > >> http://rubyforge.org/mailman/listinfo/mongrel-users > >> > > > > > > > _______________________________________________ > Mongrel-users mailing list > Mongrel-users at rubyforge.org > http://rubyforge.org/mailman/listinfo/mongrel-users >-- Evan Weaver Cloudburst, LLC
Surely it''s preferable to just delay the accept() until there''s a thread to assign it to? That way the client sees a slow connection-establishment and can draw their own conclusions, including deciding how long to wait or whether to retry. Clifford Heath, Data Constellation.> Evan Weaver wrote: >> Perhaps you are exhausting the number of worker threads in the queue >> available to mongrel (default 900ish I think). If your cached files >> are very big, maybe they aren''t being served quickly enough by the >> DirHandler and your queue becomes clogged. >> >> Should mongrel definitely send 503 after this state, or not? I think >> there was some debate recently about the same issue and the >> resolution >> was inconclusive.
Evan, I hear you! I know you have the best interests of Mongrel in mind. X-SendFile is just a header, right? If so, yeah, it could be moved to core. If we''re talking the Ruby Sendfile, then I think that should NOT be in core. I recall many people having issues (i.e. it doesn''t work) with that. Regarding the closing of the socket without notice, is that something that Ruby does, or is it that a resource limit was reached, and this handle was chosen by the OS to be closed? If the form, a HTTP 503 response is appropriate. If the latter, seems to me that another Mongrel should be employed in a cluster configuration, or the app code examined to see if it might be the source of the problem. =Will Green
It''s a Mongrel-configured limit to avoid queuing an impossibly long number of requests in an overloaded situation. So we can return whatever we want. I think the issue might be, if you can only handle 500 requests p/s, and you are getting 600, if Mongrel closes the connection, at least those 500 will get served, but if Mongrel returns 503, the web server will say "hey, error" and try on the next mongrel, which won''t help clear the request queue. The requests will still queue, just at a higher level, and noone will end up getting a request served in a sane amount of time. Evan On Oct 29, 2007 7:55 PM, Will Green <will at hotgazpacho.com> wrote:> Evan, I hear you! I know you have the best interests of Mongrel in mind. > > X-SendFile is just a header, right? If so, yeah, it could be moved to core. > > If we''re talking the Ruby Sendfile, then I think that should NOT be in core. I recall many people > having issues (i.e. it doesn''t work) with that. > > Regarding the closing of the socket without notice, is that something that Ruby does, or is it that > a resource limit was reached, and this handle was chosen by the OS to be closed? If the form, a HTTP > 503 response is appropriate. If the latter, seems to me that another Mongrel should be employed in a > cluster configuration, or the app code examined to see if it might be the source of the problem. > > => Will Green > > > _______________________________________________ > Mongrel-users mailing list > Mongrel-users at rubyforge.org > http://rubyforge.org/mailman/listinfo/mongrel-users >-- Evan Weaver Cloudburst, LLC
On 10/29/07, Will Green <will at hotgazpacho.com> wrote:> Evan, I hear you! I know you have the best interests of Mongrel in mind. > > X-SendFile is just a header, right? If so, yeah, it could be moved to core. >Yeah, he was talking about X-SendFile.> If we''re talking the Ruby Sendfile, then I think that should NOT be in core. I recall many people > having issues (i.e. it doesn''t work) with that.Also, send_file is broken on Windows too, besides it eats all your memory and hang you process. Nice, don''t you think? -- Luis Lavena Multimedia systems - Leaders are made, they are not born. They are made by hard effort, which is the price which all of us must pay to achieve any goal that is worthwhile. Vince Lombardi
Note that there is no guarantee that this is actually the issue discussed with the above configuration. It''s just an issue that has been raised before that might be related. Evan On Oct 29, 2007 8:02 PM, Evan Weaver <evan at cloudbur.st> wrote:> It''s a Mongrel-configured limit to avoid queuing an impossibly long > number of requests in an overloaded situation. So we can return > whatever we want. > > I think the issue might be, if you can only handle 500 requests p/s, > and you are getting 600, if Mongrel closes the connection, at least > those 500 will get served, but if Mongrel returns 503, the web server > will say "hey, error" and try on the next mongrel, which won''t help > clear the request queue. The requests will still queue, just at a > higher level, and noone will end up getting a request served in a sane > amount of time. > > Evan > > > On Oct 29, 2007 7:55 PM, Will Green <will at hotgazpacho.com> wrote: > > Evan, I hear you! I know you have the best interests of Mongrel in mind. > > > > X-SendFile is just a header, right? If so, yeah, it could be moved to core. > > > > If we''re talking the Ruby Sendfile, then I think that should NOT be in core. I recall many people > > having issues (i.e. it doesn''t work) with that. > > > > Regarding the closing of the socket without notice, is that something that Ruby does, or is it that > > a resource limit was reached, and this handle was chosen by the OS to be closed? If the form, a HTTP > > 503 response is appropriate. If the latter, seems to me that another Mongrel should be employed in a > > cluster configuration, or the app code examined to see if it might be the source of the problem. > > > > => > Will Green > > > > > > _______________________________________________ > > Mongrel-users mailing list > > Mongrel-users at rubyforge.org > > http://rubyforge.org/mailman/listinfo/mongrel-users > > > > > > -- > Evan Weaver > Cloudburst, LLC >-- Evan Weaver Cloudburst, LLC
See, if we used real X-Sendfile, the webserver maintainers would have to worry about Windows problems, not Mongrel ;) . For now we''ll just leave the X-Sendfile behavior alone. Evan On Oct 29, 2007 8:03 PM, Luis Lavena <luislavena at gmail.com> wrote:> On 10/29/07, Will Green <will at hotgazpacho.com> wrote: > > Evan, I hear you! I know you have the best interests of Mongrel in mind. > > > > X-SendFile is just a header, right? If so, yeah, it could be moved to core. > > > > Yeah, he was talking about X-SendFile. > > > If we''re talking the Ruby Sendfile, then I think that should NOT be in core. I recall many people > > having issues (i.e. it doesn''t work) with that. > > Also, send_file is broken on Windows too, besides it eats all your > memory and hang you process. > > Nice, don''t you think? > > -- > Luis Lavena > Multimedia systems > - > Leaders are made, they are not born. They are made by hard effort, > which is the price which all of us must pay to achieve any goal that > is worthwhile. > Vince Lombardi > > _______________________________________________ > Mongrel-users mailing list > Mongrel-users at rubyforge.org > http://rubyforge.org/mailman/listinfo/mongrel-users >-- Evan Weaver Cloudburst, LLC
Already covered in this thread: http://rubyforge.org/pipermail/mongrel-users/2007-October/004132.html Delaying the accept() would be more helpful for load balancers which, after a timeout for connect, cycle to another load balancer in the pool. Failing that, a 503 would be reasonable, and it offers a hint to users as to what''s really happening. The open/close does not. According to the http/1.1 spec the 503 and refuse-to-accept are both correct ( http://www.w3.org/Protocols/HTTP/1.1/rfc2616bis/draft-lafon-rfc2616bis-03.txt ) 10.5.4. 503 Service Unavailable The server is currently unable to handle the request due to a temporary overloading or maintenance of the server. The implication is that this is a temporary condition which will be alleviated after some delay. If known, the length of the delay MAY be indicated in a Retry-After header. If no Retry-After is given, the client SHOULD handle the response as it would for a 500 response. Note: The existence of the 503 status code does not imply that a server must use it when becoming overloaded. Some servers may wish to simply refuse the connection. Anyhow, I suggested one means for doing that in a previous thread ( entitled num_threads or accept/close or sth like that ) Clifford Heath wrote:> Surely it''s preferable to just delay the accept() until there''s a > thread to > assign it to? That way the client sees a slow connection-establishment > and can draw their own conclusions, including deciding how long to > wait or whether to retry. > > Clifford Heath, Data Constellation.-------------- next part -------------- A non-text attachment was scrubbed... Name: rob.vcf Type: text/x-vcard Size: 116 bytes Desc: not available Url : http://rubyforge.org/pipermail/mongrel-users/attachments/20071029/0d1dce2c/attachment.vcf
I think currently it accepts the connection and then immediately closes it, which is not consistent with the spec. Evan On Oct 29, 2007 9:20 PM, Robert Mela <rob at robmela.com> wrote:> Already covered in this thread: > http://rubyforge.org/pipermail/mongrel-users/2007-October/004132.html > > Delaying the accept() would be more helpful for load balancers which, > after a timeout for connect, cycle to another load balancer in the > pool. Failing that, a 503 would be reasonable, and it offers a hint to > users as to what''s really happening. The open/close does not. > > According to the http/1.1 spec the 503 and refuse-to-accept are both > correct ( > http://www.w3.org/Protocols/HTTP/1.1/rfc2616bis/draft-lafon-rfc2616bis-03.txt > ) > > 10.5.4. 503 Service Unavailable > > The server is currently unable to handle the request due to a > temporary overloading or maintenance of the server. The implication > is that this is a temporary condition which will be alleviated after > some delay. If known, the length of the delay MAY be indicated in a > Retry-After header. If no Retry-After is given, the client SHOULD > handle the response as it would for a 500 response. > > Note: The existence of the 503 status code does not imply that a > server must use it when becoming overloaded. Some servers may > wish to simply refuse the connection. > > > > Anyhow, I suggested one means for doing that in a previous thread ( > entitled num_threads or accept/close or sth like that ) > > Clifford Heath wrote: > > Surely it''s preferable to just delay the accept() until there''s a > > thread to > > assign it to? That way the client sees a slow connection-establishment > > and can draw their own conclusions, including deciding how long to > > wait or whether to retry. > > > > Clifford Heath, Data Constellation. > > > _______________________________________________ > Mongrel-users mailing list > Mongrel-users at rubyforge.org > http://rubyforge.org/mailman/listinfo/mongrel-users >-- Evan Weaver Cloudburst, LLC
Hi Evan, You are doing a really great job supporting so many people on this list - thank you! I''m learning a lot just listening. I''ve been considering that what you''re asking below (and what Robert Mela has been pushing everyone on) is essentially identifying that there might need to be at least two modes to approaching Mongrel queuing: 1) Mongrel queues all requests (current model) 2) Load balancer / webserver (or even IP stack) queues most requests I think Mongrel right now is designed solely for the case where Mongrel is supposed to queue all requests? Robert Mela seems to want an environment where Mongrel queues only some or no requests b/c he seems to have a way to get Apache + mod_proxy_magic_bullet to queue and re-try failed requests from mongrels. I wonder if it makes sense to create a mode for Mongrel where it queues only a few (or no) requests and the load balancer or webserver (or even ip stack) is designed to queue the majority of backlogged requests? Could this be a user-configurable setting (:queue_length => 2.requests or whatever)? In the event that the queue length grows bigger than this limit, Mongrel responds with a 503. If the calling agent understands 503, it be able to try other mongrels in the cluster until it finds one that is free. If they are all busy it would just keep knocking on doors until one frees up. This approach could make things worse in extreme load environments b/c now you have backed up mongrels and a pile of re-requests hammering on the door to all the mongrels as well. But that''s a worst case scenario (e.g. slashdotting) that is going to break SOMETHING SOMEWHERE anyway. So why not have it melt-down at the interface between the webservers and the mongrel cluster instead of inside the mongrels (what''s the difference)? The benefit of this alternate mode of operation would be that free mongrels get called more often and overloaded mongrels get skipped more often, which creates a much smoother user experience on the front-end, generally speaking (this approach improve performance of moderately loaded websites at the expense of punishing heavily loaded ones - who should probably add more mongrels/hardware anyway). The only changes to Mongrel code would be to allow a configurable queue length on a per-mongrel basis (maybe already in there?) and a setting to cause Mongrels to accept and return 503 instead of accepting and closing the connection? Defaults would remain the same as they now.. Would such a dual mode of operation for mongrels make sense for some users or am I just completely barking up the wrong tree here? Apologies if this is a distraction from the real issue you are discussing. Best, Steve At 06:27 PM 10/29/2007, you wrote:>Date: Mon, 29 Oct 2007 20:02:32 -0400 >From: "Evan Weaver" <evan at cloudbur.st> >Subject: Re: [Mongrel] random cpu spikes, EBADF errors >To: mongrel-users at rubyforge.org >Message-ID: > <b6f68fc60710291702x604374c4xaa27af4920dd2de7 at mail.gmail.com> >Content-Type: text/plain; charset=ISO-8859-1 > >It''s a Mongrel-configured limit to avoid queuing an impossibly long >number of requests in an overloaded situation. So we can return >whatever we want. > >I think the issue might be, if you can only handle 500 requests p/s, >and you are getting 600, if Mongrel closes the connection, at least >those 500 will get served, but if Mongrel returns 503, the web server >will say "hey, error" and try on the next mongrel, which won''t help >clear the request queue. The requests will still queue, just at a >higher level, and noone will end up getting a request served in a sane >amount of time. > >Evan>On Oct 29, 2007 7:55 PM, Will Green <will at hotgazpacho.com> wrote: > > Evan, I hear you! I know you have the best interests of Mongrel in > mind. > > > > X-SendFile is just a header, right? If so, yeah, it could be moved > to core. > > > > If we''re talking the Ruby Sendfile, then I think that should NOT be > in core. I recall many people > > having issues (i.e. it doesn''t work) with that. > > > > Regarding the closing of the socket without notice, is that > something that Ruby does, or is it that > > a resource limit was reached, and this handle was chosen by the OS > to be closed? If the form, a HTTP > > 503 response is appropriate. If the latter, seems to me that > another Mongrel should be employed in a > > cluster configuration, or the app code examined to see if it might > be the source of the problem. > >
On Mon, 29 Oct 2007 16:09:17 -0400 "Zachary Powell" <zach at plugthegap.co.uk> wrote:> Hi All, > Follow up to the CPU/EBADF issue I was having with lsws: > > > http://www.litespeedtech.com/support/forum/showthread.php?t=1012&goto=newpost > Here is the message that has just been posted: > *************** > The problem is on mongrel side. As shown in the strace output, file handle 5 > is the reverse proxy connection from LSWS to mongrel. Mongrel read the > request, then it closed the connection immediately without sending back > anything, then try to close it again with result EBADF, because the file > descriptor has been closed already.Take a look in the mongel.log with debugging on as there might be a complaint about LSWS''s interpretation of the HTTP protocol which is causing mongrel to close the connection due to a malformed request. Normally, the only time that Mongrel will abort a connection is when the client (LSWS in this case) sends a malformed request according to the HTTP grammar. When Mongrel reports what caused the close it tells you the full request that was bad. The error is usually BAD CLIENT. Also go use ethereal to get a packet trace of the traffic between the two servers to see what is being sent. You''ll probably find that LSWS is doing something that no other web server does, or at least some clue as to what is making Mongrel barf. One thing to watch for is that some web servers acting as a proxy don''t honor the Connection:close header on responses and try to keep the socket forced open, which also violates the RFC. Finally, a stack trace of where the EBADF shows up would let the Mongrel team just not close it if it''s already closed (again). Ultimately Ruby shouldn''t be throwing these errnos as separate exceptions since it means having to compensate for every platform''s interpretation of the sockets API and what should be thrown when. -- Zed A. Shaw - Hate: http://savingtheinternetwithhate.com/ - Good: http://www.zedshaw.com/ - Evil: http://yearofevil.com/
On Mon, 29 Oct 2007 16:27:49 -0400 Robert Mela <rob at robmela.com> wrote:> > When mongrel was working, it should send the reply back to LSWS > before closing the socket. > > There''s a string prepared for the purpose in mongre.rb > > ERROR_503_RESPONSE="HTTP/1.1 503 Service Unavailable\r\n\r\nBUSY".freeze > > It''s a one-liner to send that to the socket before calling close.No, that''s not the best way to do this. Think for a minute. Mongrel is overloaded. It''s having a hard time sending data. Now you want it to waste more time sending data? The general practice that works best is when a server is overloaded it aborts connections it can''t handle in order to get some free time to service more requests. This way existing pending requests get some service and in a load balancing situation the server can move on to the next available backend. The alternative in trying to handle all requests, even with small responses, will mean that nobody gets service. In reality, I bet that LSWS doesn''t try to move on to the next backend when the connection is aborted. If you think about this also, it means that when LSWS is behaving as a proxy, and one of your backends goes down, then LSWS won''t adapt and will instead complain to the user. A properly functioning proxy server that is behaving as a load balancer should try all servers possible several times until it either gets a response or has to give up because everything is down and/or it is overloaded as well. -- Zed A. Shaw - Hate: http://savingtheinternetwithhate.com/ - Good: http://www.zedshaw.com/ - Evil: http://yearofevil.com/
On Mon, 29 Oct 2007 17:43:59 -0400 "Evan Weaver" <evan at cloudbur.st> wrote:> Does Litespeed support x-sendfile? Maybe the DirHandler should be > updated to take advantage of that.Uh, wait, **Mongrel** is serving files? Nobody sees the problem with that? This isn''t a best practice at all, so first quit doing that and then see if the problem persists. There''s nothing Mongrel can do if you overload the Ruby interpreter with simple file requests that LSWS could handle. -- Zed A. Shaw - Hate: http://savingtheinternetwithhate.com/ - Good: http://www.zedshaw.com/ - Evil: http://yearofevil.com/
On Tue, 30 Oct 2007 10:53:29 +1100 Clifford Heath <clifford.heath at gmail.com> wrote:> Surely it''s preferable to just delay the accept() until there''s a > thread to > assign it to? That way the client sees a slow connection-establishment > and can draw their own conclusions, including deciding how long to > wait or whether to retry.No, then the load balancer gets bound waiting for a response from a backend that can''t respond. This causes the load balancer to get a ton of dead sockets. The LB should take the closed connection to mean "backend screwed, try again" and move to the next one. -- Zed A. Shaw - Hate: http://savingtheinternetwithhate.com/ - Good: http://www.zedshaw.com/ - Evil: http://yearofevil.com/
On Mon, 29 Oct 2007 21:27:52 -0400 "Evan Weaver" <evan at cloudbur.st> wrote:> I think currently it accepts the connection and then immediately > closes it, which is not consistent with the spec.It can''t close a connection it hasn''t accepted yet, and in practice you find that your LB gets overloaded if you don''t close it right away. -- Zed A. Shaw - Hate: http://savingtheinternetwithhate.com/ - Good: http://www.zedshaw.com/ - Evil: http://yearofevil.com/
Wow, triggered a whole discussion there (most of which was over my head, at least at this hour). I''ve bumped it up to four mongrels to see if that solves the problem (temporarily) and I''ll turn the mongrel.log debug on and see what I can find. Thanks all, Zach On 10/30/07, Zed A. Shaw <zedshaw at zedshaw.com> wrote:> > On Mon, 29 Oct 2007 21:27:52 -0400 > "Evan Weaver" <evan at cloudbur.st> wrote: > > > I think currently it accepts the connection and then immediately > > closes it, which is not consistent with the spec. > > It can''t close a connection it hasn''t accepted yet, and in practice you > find that your LB gets overloaded if you don''t close it right away. > > -- > Zed A. Shaw > - Hate: http://savingtheinternetwithhate.com/ > - Good: http://www.zedshaw.com/ > - Evil: http://yearofevil.com/ > _______________________________________________ > Mongrel-users mailing list > Mongrel-users at rubyforge.org > http://rubyforge.org/mailman/listinfo/mongrel-users >-------------- next part -------------- An HTML attachment was scrubbed... URL: http://rubyforge.org/pipermail/mongrel-users/attachments/20071030/43112226/attachment.html
On Tue, 30 Oct 2007 09:19:24 -0400 "Zachary Powell" <zach at plugthegap.co.uk> wrote:> Wow, triggered a whole discussion there (most of which was over my head, at > least at this hour). I''ve bumped it up to four mongrels to see if that > solves the problem (temporarily) and I''ll turn the mongrel.log debug on and > see what I can find.It is a common issue though with the HTTP RFC and what load balancers should be doing. Effectively, the RFC describes a web server, proxy, and client, but not really an LB of any kind. When people follow the RFC they get some dumb behaviors from their web server that shouldn''t apply to an LB. For example, many web servers will take the 503 responses from the backends and then show them to the end user, which if you read the RFC is kind of right but really wrong (it should try again). Others will take the RFC literally and make a connection to a backend then hang out, which is wrong in a practical sense since that means a mis-configured backend can cripple the LB. Imagine if the LB had to wait for the "official" TCP timeout of anywhere from 60 seconds to 200,000 days depending on the operating system. (Yes pedants, that''s exagerated.) There''s also practical considerations when dealing with heavy loaded network servers in general. I believe that the HTTP people got this one all wrong in that they require a response, but logically if your server is overloaded, you can''t give a response. So yes, you started a useful conversation since people are going to keep hitting this over and over. The solution of course is the following: ** The HTTP RFC doesn''t cover load balancers (or even proxy servers) in any sufficient detail to be useful. ** That''s the gist of it really. Let us know what comes of your changes. -- Zed A. Shaw - Hate: http://savingtheinternetwithhate.com/ - Good: http://www.zedshaw.com/ - Evil: http://yearofevil.com/