Hi One of the two mongrel processes has died in the middle of the night four times in the past 9 days, and I need help debugging this. Each time the symptoms are the same: * Each time I can restart the process via cap -a restart_app. * Before the restart, there is nothing unusual in production.log or mongrel.log. * During the restart, about 100 repetitions of an error message are generated in mongrel.log (see below). * I followed the suggestions on mongrel.rubyforge.org/faq: lsof -i -P | grep CLOSE_WAIT 99% CPU Memory Leak None of those show any problems. Before the restart, top shows the largest memory processes are: top - 07:17:05 up 31 days, 20:38, 0 users, load average: 0.00, 0.00, 0.00 Tasks: 46 total, 2 running, 44 sleeping, 0 stopped, 0 zombie Cpu(s): 0.0% us, 0.0% sy, 0.0% ni, 100.0% id, 0.0% wa, 0.0% hi, 0.0% si Mem: 262316k total, 239700k used, 22616k free, 3412k buffers Swap: 0k total, 0k used, 0k free, 88320k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 18001 root 16 0 45712 39m 2584 S 0.0 15.6 0:20.79 mongrel_rails 18004 root 16 0 43624 38m 2524 S 0.0 15.0 0:25.93 mongrel_rails 2632 mysql 16 0 109m 27m 3100 S 0.0 10.8 5:37.37 mysqld After the restart, memory usage rapidly approaches above values while the application runs normally. * Yesterday I updated production.rb with this entry: ActiveRecord::Base.verification_timeout = 14400 and this did not fix the problem. Note I am using 0.3.13.5 because I had problems with 0.3.13.4 not restarting when my railsmachine system reboots. However, I would gladly switch back to 0.3.13.4 if it solved this problem. Useage is very low - only about 50-200 requests / hour. One more factoid - may be irrelevant, but it appears that shortly before each of the four crashes, one or two pdf files were downloaded from my public directory - which I believe apache does bypassing mongrel. Suggestions? Thanks Robert Vogel Error messages in mongrel.log generated during restart: Tue Oct 31 07:26:30 PST 2006: Error calling Dispatcher.dispatch #<Sync_m::Err::UnknownLocker: Thread(#<Thread:0xb7544c58 aborting>) not locked.> /usr/lib/ruby/1.8/sync.rb:57:in `Fail'' /usr/lib/ruby/1.8/sync.rb:63:in `Fail'' /usr/lib/ruby/1.8/sync.rb:183:in `sync_unlock'' /usr/lib/ruby/1.8/sync.rb:231:in `synchronize'' /usr/lib/ruby/gems/1.8/gems/mongrel-0.3.13.5/lib/mongrel/rails.rb:81:in `process'' /usr/lib/ruby/gems/1.8/gems/mongrel-0.3.13.5/lib/mongrel.rb:583:in `process_client'' /usr/lib/ruby/gems/1.8/gems/mongrel-0.3.13.5/lib/mongrel.rb:582:in `process_client'' /usr/lib/ruby/gems/1.8/gems/mongrel-0.3.13.5/lib/mongrel.rb:689:in `run'' /usr/lib/ruby/gems/1.8/gems/mongrel-0.3.13.5/lib/mongrel.rb:689:in `run'' /usr/lib/ruby/gems/1.8/gems/mongrel-0.3.13.5/lib/mongrel.rb:676:in `run'' /usr/lib/ruby/gems/1.8/gems/mongrel-0.3.13.5/lib/mongrel/configurator.rb:271:in `run'' /usr/lib/ruby/gems/1.8/gems/mongrel-0.3.13.5/lib/mongrel/configurator.rb:270:in `run'' /usr/lib/ruby/gems/1.8/gems/mongrel-0.3.13.5/bin/mongrel_rails:124:in `run'' /usr/lib/ruby/gems/1.8/gems/mongrel-0.3.13.5/lib/mongrel/command.rb:211:in `run'' /usr/lib/ruby/gems/1.8/gems/mongrel-0.3.13.5/bin/mongrel_rails:234 /usr/bin/mongrel_rails:18 Tue Oct 31 07:26:30 PST 2006: Error calling Dispatcher.dispatch #<Sync_m::Err::UnknownLocker: Thread(#<Thread:0xb7545540 aborting>) not locked.> /usr/lib/ruby/1.8/sync.rb:57:in `Fail'' /usr/lib/ruby/1.8/sync.rb:63:in `Fail'' /usr/lib/ruby/1.8/sync.rb:183:in `sync_unlock'' etc. The tail of mongrel.log after the restart is: /usr/lib/ruby/gems/1.8/gems/mongrel-0.3.13.5/bin/mongrel_rails:234 /usr/bin/mongrel_rails:18 /var/www/apps/kd/current/config/../vendor/rails/activerecord/lib/active_record/transactions.rb:84:in `transaction'': Transaction aborted (ActiveRecord::Transactions::TransactionError) from /usr/lib/ruby/gems/1.8/gems/mongrel-0.3.13.5/lib/mongrel/configurator.rb:293:in `join'' from /usr/lib/ruby/gems/1.8/gems/mongrel-0.3.13.5/lib/mongrel/configurator.rb:293:in `join'' from /usr/lib/ruby/gems/1.8/gems/mongrel-0.3.13.5/bin/mongrel_rails:133:in `run'' from /usr/lib/ruby/gems/1.8/gems/mongrel-0.3.13.5/lib/mongrel/command.rb:211:in `run'' from /usr/lib/ruby/gems/1.8/gems/mongrel-0.3.13.5/bin/mongrel_rails:234 from /usr/bin/mongrel_rails:18 deadlock 0xb71a8518: sleep:- - /usr/lib/ruby/1.8/thread.rb:100 deadlock 0xb7e3c748: sleep:- (main) - /usr/lib/ruby/1.8/thread.rb:100 ** TERM signal received. ** Daemonized, any open files are closed. Look at log/mongrel.8000.pid and log/mongrel.log for info. ** Starting Mongrel listening at 127.0.0.1:8000 ** Starting Rails with production environment... ** Daemonized, any open files are closed. Look at log/mongrel.8001.pid and log/mongrel.log for info. ** Starting Mongrel listening at 127.0.0.1:8001 ** Starting Rails with production environment... ** Rails loaded. ** Loading any Rails specific GemPlugins ** Signals ready. TERM => stop. USR2 => restart. INT => stop (no restart). ** Rails signals registered. HUP => reload (without restart). It might not work well. ** Mongrel available at 127.0.0.1:8000 ** Writing PID file to log/mongrel.8000.pid ** Rails loaded. ** Loading any Rails specific GemPlugins ** Signals ready. TERM => stop. USR2 => restart. INT => stop (no restart). ** Rails signals registered. HUP => reload (without restart). It might not work well. ** Mongrel available at 127.0.0.1:8001 ** Writing PID file to log/mongrel.8001.pid
On Tue, 31 Oct 2006 12:48:02 -0700 Robert Vogel <robert at kitchendemocracy.org> wrote:> Hi > > One of the two mongrel processes has died in the middle of the night > four times in the past 9 days, and I need help debugging this. > > Each time the symptoms are the same:Really, quick, but upgrade to the pre-release and then tell me if you still get these: sudo gem install mongrel --source=http://mongrel.rubyforge.org/releases If it does not fix the problem (remember, it''s random so let it run in production for a while), then turn on USR1 logging and watch for the rails action that is blocking things: sudo killall -USR1 mongrel_rails Otherwise, keep in mind that many many people use Mongrel without blocking problems, so you need to rule out anything non-standard you''re using that can cause problems. RMagick, frequent DNS calls, working with files or shared resources, are all main culprits. -- Zed A. Shaw, MUDCRAP-CE Master Black Belt Sifu http://www.zedshaw.com/ http://safari.oreilly.com/0321483502 -- The Mongrel Book http://mongrel.rubyforge.org/ http://www.lingr.com/room/3yXhqKbfPy8 -- Come get help.
Hello - I need help. For the past 11 days, one of the two mongrel processes on my railsmachine VPS has been crashing intermittently - it has crashed about 10 times, with increasing frequency in the past few days. Unfortunately, after many, many hours I still have not been able to reproduce this problem in a controlled way - neither on my production railsmachine nor on my development machine. As far as I can tell, I have followed these suggestions from Bradley and Zed and Zed''s Mongrel book: lsop -i -P | grep CLOSE_WAIT shows nothing 99% CPU is not associated with either mongrel process - the CPU is never above 5%, usually at 0%, both while the process is crashed and while they are running Memory leak seems impossible. %MEM for both processes never above 15% both when crashed and when running. Dash Bee logging on my development machine shows that no object is steadily increasing its memory consumption - garbage collection seems to be working fine. Dash Bee logging on my develoment machine also shows no leaking files. Number of open files is stable (at 6). Traffic is miniscule (< 100 requests / hour); Inserting ActiveRecord::Base.verification_timeout = 14400 in environments/production.rb had no effect. Upgrading to pre-release mongrel had no effect: sudo gem install mongrel --source=http://mongrel.rubyforge.org/releases My application is butt simple, and supported by oodles of unit, functional and integration test code. There is no transaction processing in the application and no opportunity for a jammed request due to using shared resources without proper locking. The application does not use RMagik, does not explicitely manipulate any files, and though I am not sure what you mean by ''shared resources'', I suspect I am not using any. The only external libraries (external to rails) are three gems which access geo-coding services - but these were not in play when the processes crashed. killall -USR1 mongrel_rails has been in effect now through the last two crashes. The rails action which held things up was different in both cases - and is butt simple in both cases. Here is the mongrel.log in the vicinity of those two crashes: Thu Nov 02 13:07:16 PST 2006: 0 threads sync_waiting for /, 1 still active in Mongrel. Thu Nov 02 13:07:19 PST 2006: 0 threads sync_waiting for /login, 1 still active in Mongrel. Thu Nov 02 13:07:27 PST 2006: 0 threads sync_waiting for /login, 1 still active in Mongrel. Thu Nov 02 13:07:33 PST 2006: 0 threads sync_waiting for /admin/list_vote, 1 still active in Mongrel. Thu Nov 02 13:07:42 PST 2006: 0 threads sync_waiting for /admin/mark_reviewed, 1 still active in Mongrel. Thu Nov 02 13:08:17 PST 2006: 0 threads sync_waiting for /admin/mark_reviewed, 1 still active in Mongrel. Thu Nov 02 13:08:26 PST 2006: 0 threads sync_waiting for /admin/mark_reviewed, 3 still active in Mongrel. Thu Nov 02 13:08:37 PST 2006: 0 threads sync_waiting for /admin/mark_reviewed, 3 still active in Mongrel. Thu Nov 02 13:09:08 PST 2006: 1 threads sync_waiting for /admin/mark_reviewed, 4 still active in Mongrel. Thu Nov 02 13:09:35 PST 2006: Error calling Dispatcher.dispatch #<Sync_m::Err::UnknownLocker: Thread(#<Thread:0xb7234be4 aborting>) not locked.> /usr/lib/ruby/1.8/sync.rb:57:in `Fail'' /usr/lib/ruby/1.8/sync.rb:63:in `Fail'' and Thu Nov 02 00:05:29 PST 2006: 0 threads sync_waiting for /berkeley/downzoning/comments, 1 still active in Mongrel. Thu Nov 02 00:05:37 PST 2006: 0 threads sync_waiting for /berkeley/downzoning/comments, 1 still active in Mongrel. Thu Nov 02 00:06:11 PST 2006: 0 threads sync_waiting for /berkeley/downzoning/comments, 1 still active in Mongrel. Thu Nov 02 00:07:07 PST 2006: 0 threads sync_waiting for /berkeley/downzoning/comments, 1 still active in Mongrel. Thu Nov 02 00:07:27 PST 2006: 0 threads sync_waiting for /email_updates, 1 still active in Mongrel. Thu Nov 02 00:07:27 PST 2006: 0 threads sync_waiting for /email_updates_edit, 1 still active in Mongrel. Thu Nov 02 00:07:53 PST 2006: 0 threads sync_waiting for /berkeley/bus_rapid_transit/page/brtqanda, 1 still active in Mongrel. Thu Nov 02 00:08:11 PST 2006: 0 threads sync_waiting for /email_updates_edit, 2 still active in Mongrel. Thu Nov 02 00:08:39 PST 2006: 0 threads sync_waiting for /robots.txt, 1 still active in Mongrel. Thu Nov 02 00:08:39 PST 2006: 1 threads sync_waiting for /email_updates_edit, 3 still active in Mongrel. Thu Nov 02 00:09:50 PST 2006: 0 threads sync_waiting for /howitworks.php, 1 still active in Mongrel. Thu Nov 02 00:09:50 PST 2006: 3 threads sync_waiting for /email_updates_edit, 5 still active in Mongrel. So - as you can tell, I am a newbie at wits end, hoping you guys can 1) help me fix the problem, and 2) help me implement a temporary workaround so I can stop checking every few hours to see if I need to cap -a restart_app (which so far, has always worked...) Thanks for your careful attention. Cheers Robert Vogel> -------- Original Message -------- > Subject: Re: [Mongrel] Problems with mongrel dying > From: "Zed A. Shaw" <zedshaw at zedshaw.com> > Date: Tue, October 31, 2006 2:36 pm > To: mongrel-users at rubyforge.org > > On Tue, 31 Oct 2006 12:48:02 -0700 > Robert Vogel <robert at kitchendemocracy.org> wrote: > > > Hi > > > > One of the two mongrel processes has died in the middle of the night > > four times in the past 9 days, and I need help debugging this. > > > > Each time the symptoms are the same: > > Really, quick, but upgrade to the pre-release and then tell me if you still get these: > > sudo gem install mongrel --source=http://mongrel.rubyforge.org/releases > > If it does not fix the problem (remember, it''s random so let it run in production for a while), then turn on USR1 logging and watch for the rails action that is blocking things: > > sudo killall -USR1 mongrel_rails > > Otherwise, keep in mind that many many people use Mongrel without blocking problems, so you need to rule out anything non-standard you''re using that can cause problems. RMagick, frequent DNS calls, working with files or shared resources, are all main culprits. > > -- > Zed A. Shaw, MUDCRAP-CE Master Black Belt Sifu > http://www.zedshaw.com/ > http://safari.oreilly.com/0321483502 -- The Mongrel Book > http://mongrel.rubyforge.org/ > http://www.lingr.com/room/3yXhqKbfPy8 -- Come get help. > _______________________________________________ > Mongrel-users mailing list > Mongrel-users at rubyforge.org > http://rubyforge.org/mailman/listinfo/mongrel-users
Hi Robert, I have no idea about your problem but the workaround is simple. Use Monit to regulate your mongrels (it''ll do the checking and restarting for you every 3 minutes) I risk falling on Zed''s sword of wrath but my mongrels escalate in memory usage under load and I use Monit to restart them and keep em tidy (I fully presume it''s because of MY lousy code) Sam
On Fri, 3 Nov 2006 15:25:32 +1300 "Sam Giffney" <samuelgiffney at gmail.com> wrote:> Hi Robert, > I have no idea about your problem but the workaround is simple. Use > Monit to regulate your mongrels (it''ll do the checking and restarting > for you every 3 minutes) > I risk falling on Zed''s sword of wrath but my mongrels escalate in > memory usage under load and I use Monit to restart them and keep em > tidy > (I fully presume it''s because of MY lousy code)That''s probably the best you can do until you can find out why it''s leaking. My past experience has been to never trust Ruby''s GC or any external C extensions you may be using. I''ve combed through Mongrel to insane levels to root out all possible leaks I can. Yet, the Sync vs. Mutex bug and recent Array patch from Eric M. shows that Ruby''s GC has a long way to go. If you can''t find it right away with the -B option and log/mongrel_debug/objects.log then it''s probably too deep inside Ruby or a C extension you''re using (RMagick is a big one). Just hit it with monit and at least your site keeps going. Ugly, but it works. -- Zed A. Shaw, MUDCRAP-CE Master Black Belt Sifu http://www.zedshaw.com/ http://safari.oreilly.com/0321483502 -- The Mongrel Book http://mongrel.rubyforge.org/ http://www.lingr.com/room/3yXhqKbfPy8 -- Come get help.
On Thu, 02 Nov 2006 19:13:54 -0700 Robert Vogel <robert at kitchendemocracy.org> wrote:> Hello - > > I need help. For the past 11 days, one of the two mongrel processes on > my railsmachine VPS has been crashing intermittently - it has crashed > about 10 times, with increasing frequency in the past few days. > Unfortunately, after many, many hours I still have not been able to > reproduce this problem in a controlled way - neither on my production > railsmachine nor on my development machine. As far as I can tell, I > have followed these suggestions from Bradley and Zed and Zed''s Mongrel > book:That''s awesomely complete for an analysis. And you say they just "poof" go away? Ok, so there *has* to be something different about your setup compared to other folks. I''m thinking the oom killer thing is to blame. Usually when someone says it "just goes boom" it''s because of this. While you may think your mongrel process isn''t using that much ram, keep in mind it''s based on how much is available vs. what''s used. You could have a mysql that eats all the ram and Linux decides to kill your mongrel. So, to confirm this isn''t the case, monitor your memory usage totals during the day, and install monit to e-mail you when it needs to restart. Monit will tell you it had to restart, then you can jump on and see what your memory usage log was reporting. I usually just run a little cron job that spits out the free command to a file once every 5 minutes or so. Also, look in the linux log files. IIRC the oom killer doesn''t say anything, but check to see if they improved this "magic". Zed
Hello Zed - Thanks for the response; I will follow your advice, and will get right on installing monit. Question - you also say ''look in the linux log files''. Which log files should I look at, where are they and what should I look for? Thanks Robert That''s awesomely complete for an analysis. And you say they just "poof" go away? Ok, so there *has* to be something different about your setup compared to other folks. I''m thinking the oom killer thing is to blame. Usually when someone says it "just goes boom" it''s because of this. While you may think your mongrel process isn''t using that much ram, keep in mind it''s based on how much is available vs. what''s used. You could have a mysql that eats all the ram and Linux decides to kill your mongrel. So, to confirm this isn''t the case, monitor your memory usage totals during the day, and install monit to e-mail you when it needs to restart. Monit will tell you it had to restart, then you can jump on and see what your memory usage log was reporting. I usually just run a little cron job that spits out the free command to a file once every 5 minutes or so. Also, look in the linux log files. IIRC the oom killer doesn''t say anything, but check to see if they improved this "magic".
On 11/2/06, Zed A. Shaw <zedshaw at zedshaw.com> wrote:> That''s probably the best you can do until you can find out why it''s leaking. My past > experience has been to never trust Ruby''s GC or any external C extensions you may be > using. I''ve combed through Mongrel to insane levels to root out all possible leaks I can. > Yet, the Sync vs. Mutex bug and recent Array patch from Eric M. shows that Ruby''s GC has > a long way to go.I''ll 1/2 second Zed''s comments here. I''ve spent a lot of time poking around in Ruby internals looking at memory usage. It is very easy for someone to write a C extension that mismanages memory and causes Ruby to leak, so always look suspiciously at an extension if you have a leak that you can''t find another cause for, unless you know with great confidence that the extension is solid. The Ruby GC itself is pretty simple and does what it is supposed to. It will tend to have performance issues as the set of objects in RAM increases, though there are strategies a person can sometimes use to manage that, if needed. The Sync vs Mutex thing, though, can not be laid at the foot of the Ruby GC. The problem is with Array.c not releasing it''s data in a way that allows the Ruby GC to handle it. Refer back to the beginning of this email about how easy it is to screw up memory management in C extensions.... Now, that said, if you are using arrays and are using push and shift operations to manage an array like a queue (or any libraries, like Mutex, that you are using do this), that _will_ bite you in the ass with memory usage, because of this Array bug. Mutex has much better performance than Sync, though, especially if there are more than a very small number of threads, so in this specific case I continue to use a Mutex, but have patched around the problemlematic Array usage by creating my own copy of the Mutex class that uses Array in a way that doesn''t suffer from the bug. Kirk Haines Kirk Haines
On Nov 3, 2006, at 8:50 AM, Kirk Haines wrote:> On 11/2/06, Zed A. Shaw <zedshaw at zedshaw.com> wrote: > >> That''s probably the best you can do until you can find out why >> it''s leaking. My past >> experience has been to never trust Ruby''s GC or any external C >> extensions you may be >> using. I''ve combed through Mongrel to insane levels to root out >> all possible leaks I can. >> Yet, the Sync vs. Mutex bug and recent Array patch from Eric M. >> shows that Ruby''s GC has >> a long way to go. > > I''ll 1/2 second Zed''s comments here. I''ve spent a lot of time poking > around in Ruby internals looking at memory usage. It is very easy for > someone to write a C extension that mismanages memory and causes Ruby > to leak, so always look suspiciously at an extension if you have a > leak that you can''t find another cause for, unless you know with great > confidence that the extension is solid. > > The Ruby GC itself is pretty simple and does what it is supposed to. > It will tend to have performance issues as the set of objects in RAM > increases, though there are strategies a person can sometimes use to > manage that, if needed. > > The Sync vs Mutex thing, though, can not be laid at the foot of the > Ruby GC. The problem is with Array.c not releasing it''s data in a way > that allows the Ruby GC to handle it. Refer back to the beginning of > this email about how easy it is to screw up memory management in C > extensions.... > > Now, that said, if you are using arrays and are using push and shift > operations to manage an array like a queue (or any libraries, like > Mutex, that you are using do this), that _will_ bite you in the ass > with memory usage, because of this Array bug. > > Mutex has much better performance than Sync, though, especially if > there are more than a very small number of threads, so in this > specific case I continue to use a Mutex, but have patched around the > problemlematic Array usage by creating my own copy of the Mutex class > that uses Array in a way that doesn''t suffer from the bug. > > > Kirk HainesHey Kirk- Would you mind sharing your patched mutex.rb file? I woudl appreciate it. Thanks- -- Ezra Zygmuntowicz -- Lead Rails Evangelist -- ez at engineyard.com -- Engine Yard, Serious Rails Hosting -- (866) 518-YARD (9273)
On 11/3/06, Ezra Zygmuntowicz <ezmobius at gmail.com> wrote:> > On Nov 3, 2006, at 8:50 AM, Kirk Haines wrote: > > > On 11/2/06, Zed A. Shaw <zedshaw at zedshaw.com> wrote: > > > >> That''s probably the best you can do until you can find out why > >> it''s leaking. My past > >> experience has been to never trust Ruby''s GC or any external C > >> extensions you may be > >> using. I''ve combed through Mongrel to insane levels to root out > >> all possible leaks I can. > >> Yet, the Sync vs. Mutex bug and recent Array patch from Eric M. > >> shows that Ruby''s GC has > >> a long way to go. > > > > I''ll 1/2 second Zed''s comments here. I''ve spent a lot of time poking > > around in Ruby internals looking at memory usage. It is very easy for > > someone to write a C extension that mismanages memory and causes Ruby > > to leak, so always look suspiciously at an extension if you have a > > leak that you can''t find another cause for, unless you know with great > > confidence that the extension is solid. > > > > The Ruby GC itself is pretty simple and does what it is supposed to. > > It will tend to have performance issues as the set of objects in RAM > > increases, though there are strategies a person can sometimes use to > > manage that, if needed. > > > > The Sync vs Mutex thing, though, can not be laid at the foot of the > > Ruby GC. The problem is with Array.c not releasing it''s data in a way > > that allows the Ruby GC to handle it. Refer back to the beginning of > > this email about how easy it is to screw up memory management in C > > extensions.... > > > > Now, that said, if you are using arrays and are using push and shift > > operations to manage an array like a queue (or any libraries, like > > Mutex, that you are using do this), that _will_ bite you in the ass > > with memory usage, because of this Array bug. > > > > Mutex has much better performance than Sync, though, especially if > > there are more than a very small number of threads, so in this > > specific case I continue to use a Mutex, but have patched around the > > problemlematic Array usage by creating my own copy of the Mutex class > > that uses Array in a way that doesn''t suffer from the bug. > > > > > > Kirk Haines > > Hey Kirk- > > Would you mind sharing your patched mutex.rb file? I woudl > appreciate it.IIRC, the patch consist of replacing shift with pop, and push with unshift respectively thus reversing the queue direction. The idea is to avoid Array#shift as it retains the reference to the removed item.