Hi, I have an application which is dying horrible deaths (i.e. segmentation faults) in mid-flight, in production... And of course, I should fix it. But while I find and fix the bugs, I found something I think should be different - I can work on submitting a patch, as it is quite simple, but I might be losing something on my rationale. When Mongrel segfaults, it does not -obviously- get to clean up after itself, so it does not remove the PID files. As an example: $ sudo /etc/init.d/mongrel-cluster start Starting mongrel-cluster: Starting all mongrel_clusters... mongrel-cluster. $ sudo cat tmp/pids/mongrel.8203.pid | xargs kill -9 $ sudo /etc/init.d/mongrel-cluster status (...) found pid_file: tmp/pids/mongrel.8203.pid missing mongrel_rails: port 8203 (...) $ sudo /etc/init.d/mongrel-cluster restart Restarting mongrel-cluster: Restarting all mongrel_clusters... ** !!! PID file tmp/pids/mongrel.8203.pid already exists. Mongrel could be running already. Check your log/mongrel.8203.log for errors. ** !!! Exiting with error. You must stop mongrel and clear the .pid before I''ll attempt a start. mongrel-cluster. So, what''s the solution? I must manually do: $ sudo rm tmp/pids/mongrel.8203.pid $ sudo /etc/init.d/mongrel-cluster restart And now it works. What should happen? Well, ''status'' already found that there is a stale PID. Of course, the ''status'' action means exactly that: Get the status, do nothing else. But the ''stop'' action should clean the PIDs if they do no longer exist, and the ''start'' action should check whether the process with that PID is alive, and ignore it if it''s not. At least, this behaviour should be specifiable via the configuration file. What do you think? -- Gunnar Wolf - gwolf at iiec.unam.mx - (+52-55)5623-0154 / 1451-2244 PGP key 1024D/8BB527AF 2001-10-23 Fingerprint: 0C79 D2D1 2C4E 9CE4 5973 F800 D80E F35A 8BB5 27AF
use the mongrel_cluster --clean option On 6/5/08, Gunnar Wolf <gwolf at gwolf.org> wrote:> > Hi, > > I have an application which is dying horrible deaths > (i.e. segmentation faults) in mid-flight, in production... And of > course, I should fix it. But while I find and fix the bugs, I found > something I think should be different - I can work on submitting a > patch, as it is quite simple, but I might be losing something on my > rationale. > > When Mongrel segfaults, it does not -obviously- get to clean up after > itself, so it does not remove the PID files. As an example: > > $ sudo /etc/init.d/mongrel-cluster start > Starting mongrel-cluster: Starting all mongrel_clusters... > mongrel-cluster. > $ sudo cat tmp/pids/mongrel.8203.pid | xargs kill -9 > $ sudo /etc/init.d/mongrel-cluster status > (...) > found pid_file: tmp/pids/mongrel.8203.pid > missing mongrel_rails: port 8203 > (...) > $ sudo /etc/init.d/mongrel-cluster restart > Restarting mongrel-cluster: Restarting all mongrel_clusters... > ** !!! PID file tmp/pids/mongrel.8203.pid already exists. Mongrel could be > running already. Check your log/mongrel.8203.log for errors. > ** !!! Exiting with error. You must stop mongrel and clear the .pid before > I''ll attempt a start. > mongrel-cluster. > > So, what''s the solution? I must manually do: > > $ sudo rm tmp/pids/mongrel.8203.pid > $ sudo /etc/init.d/mongrel-cluster restart > > And now it works. > > What should happen? Well, ''status'' already found that there is a stale > PID. Of course, the ''status'' action means exactly that: Get the > status, do nothing else. But the ''stop'' action should clean the PIDs > if they do no longer exist, and the ''start'' action should check > whether the process with that PID is alive, and ignore it if it''s > not. At least, this behaviour should be specifiable via the > configuration file. > > What do you think? > > > -- > Gunnar Wolf - gwolf at iiec.unam.mx - (+52-55)5623-0154 / 1451-2244 > PGP key 1024D/8BB527AF 2001-10-23 > Fingerprint: 0C79 D2D1 2C4E 9CE4 5973 F800 D80E F35A 8BB5 27AF > _______________________________________________ > Mongrel-users mailing list > Mongrel-users at rubyforge.org > http://rubyforge.org/mailman/listinfo/mongrel-users >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://rubyforge.org/pipermail/mongrel-users/attachments/20080605/833e6299/attachment-0001.html>
On Thu, 5 Jun 2008 16:08:06 -0500 Gunnar Wolf <gwolf at gwolf.org> wrote:> What should happen? Well, ''status'' already found that there is a stale > PID. Of course, the ''status'' action means exactly that: Get the > status, do nothing else. But the ''stop'' action should clean the PIDs > if they do no longer exist, and the ''start'' action should check > whether the process with that PID is alive, and ignore it if it''s > not. At least, this behaviour should be specifiable via the > configuration file.That would be the ideal situation, but Ruby doesn''t have good enough process management APIs to do this portably. To make it work you''d have to portably be able to take a PID and see if there''s a mongrel running with that PID. You can''t use /proc or /sys because that''s linux only. You can''t use `ps` because the OSX morons changed everything, Solaris has different format, etc. If you were to do this, you''d have to dip into C code to pull it off. Now, if you''re only on linux then you could write yourself a small little hack to the mongrel_rails script that did this with info out of /proc. -- Zed A. Shaw - Hate: http://savingtheinternetwithhate.com/ - Good: http://www.zedshaw.com/ - Evil: http://yearofevil.com/
At Thu, 5 Jun 2008 16:08:06 -0500, Gunnar Wolf <gwolf at gwolf.org> wrote:> Hi, > > I have an application which is dying horrible deaths > (i.e. segmentation faults) in mid-flight, in production... And of > course, I should fix it. But while I find and fix the bugs, I found > something I think should be different - I can work on submitting a > patch, as it is quite simple, but I might be losing something on my > rationale. > > [?]I use the following bit in my Capistrano scripts before I start Mongrel: ( [ -f pid_file ] && ( kill -0 `cat pid_file` >& /dev/null || rm pid_file ) ) which handles the typical cases (in which no process with a given pid is running, or a process is running with a different owner from the mongrel owner) but not the edge case where a process is running, with the same owner, but is no longer a mongrel process. You could supplement this with Linux/Solaris specific stuff to check if the process running is actually a mongrel. best, Erik Hetzner -------------- next part -------------- ;; Erik Hetzner, California Digital Library ;; gnupg key id: 1024D/01DB07E3 -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available URL: <http://rubyforge.org/pipermail/mongrel-users/attachments/20080606/48d231e3/attachment.bin>
kill -0 `cat pid_file` >& /dev/null more like kill -0 $(<pid_file) >& /dev/null regards, Istvan Erik Hetzner wrote:> At Thu, 5 Jun 2008 16:08:06 -0500, > Gunnar Wolf <gwolf at gwolf.org> wrote: > >> Hi, >> >> I have an application which is dying horrible deaths >> (i.e. segmentation faults) in mid-flight, in production... And of >> course, I should fix it. But while I find and fix the bugs, I found >> something I think should be different - I can work on submitting a >> patch, as it is quite simple, but I might be losing something on my >> rationale. >> >> [?] >> > > I use the following bit in my Capistrano scripts before I start > Mongrel: > > ( [ -f pid_file ] && ( kill -0 `cat pid_file` >& /dev/null || rm pid_file ) ) > > which handles the typical cases (in which no process with a given pid > is running, or a process is running with a different owner from the > mongrel owner) but not the edge case where a process is running, with > the same owner, but is no longer a mongrel process. You could > supplement this with Linux/Solaris specific stuff to check if the > process running is actually a mongrel. > > best, > Erik Hetzner > > ------------------------------------------------------------------------ > > ;; Erik Hetzner, California Digital Library > ;; gnupg key id: 1024D/01DB07E3 > > ------------------------------------------------------------------------ > > _______________________________________________ > Mongrel-users mailing list > Mongrel-users at rubyforge.org > http://rubyforge.org/mailman/listinfo/mongrel-users
Zed A. Shaw dijo [Fri, Jun 06, 2008 at 01:01:32AM -0400]:> That would be the ideal situation, but Ruby doesn''t have good enough > process management APIs to do this portably. To make it work you''d > have to portably be able to take a PID and see if there''s a mongrel > running with that PID. > > You can''t use /proc or /sys because that''s linux only. You can''t use > `ps` because the OSX morons changed everything, Solaris has different > format, etc. > > If you were to do this, you''d have to dip into C code to pull it off. > > Now, if you''re only on linux then you could write yourself a small > little hack to the mongrel_rails script that did this with info out > of /proc.Oh, silly me... I thought Ruby''s Process class did with the architectural incompatibilities... What I wrote to check for the status is quite straightforward: ------------------------------------------------------------ #!/usr/bin/ruby require ''yaml'' confdir = ''/etc/mongrel-cluster/sites-enabled'' restart_cmd = ''/etc/init.d/mongrel-cluster restart'' needs_restart = false (Dir.open(confdir).entries - [''.'', ''..'']).each do |site| conf = YAML.load_file "#{confdir}/#{site}" pid_location = [conf[''cwd''], conf[''pid_file'']].join(''/'').gsub(/\.pid$/, ''*.pid'') pid_files = Dir.glob(pid_location) pid_files.each do |pidf| pid = File.read(pidf) begin Process.getpgid(pid.to_i) rescue Errno::ESRCH warn "Process #{pid} (cluster #{site}) is dead!" File.unlink pidf needs_restart = true end end end system(restart_cmd) if needs_restart ------------------------------------------------------------ (periodically run via cron) I guess this works in any Unixy environment... I have no idea on whether Windows implements something similar to Process.getpgid, or for that matter, anything on Windows'' process management. Greetings, -- Gunnar Wolf - gwolf at gwolf.org - (+52-55)5623-0154 / 1451-2244 PGP key 1024D/8BB527AF 2001-10-23 Fingerprint: 0C79 D2D1 2C4E 9CE4 5973 F800 D80E F35A 8BB5 27AF
Tikhon Bernstam dijo [Thu, Jun 05, 2008 at 07:29:22PM -0700]:> use the mongrel_cluster --clean optionVery good addition to the overall logic, keeps things cleaner :-) -- Gunnar Wolf - gwolf at gwolf.org - (+52-55)5623-0154 / 1451-2244 PGP key 1024D/8BB527AF 2001-10-23 Fingerprint: 0C79 D2D1 2C4E 9CE4 5973 F800 D80E F35A 8BB5 27AF
Gunnar Wolf <gwolf at gwolf.org> wrote:> Zed A. Shaw dijo [Fri, Jun 06, 2008 at 01:01:32AM -0400]: > > That would be the ideal situation, but Ruby doesn''t have good enough > > process management APIs to do this portably. To make it work you''d > > have to portably be able to take a PID and see if there''s a mongrel > > running with that PID. > > > > You can''t use /proc or /sys because that''s linux only. You can''t use > > `ps` because the OSX morons changed everything, Solaris has different > > format, etc. > > > > If you were to do this, you''d have to dip into C code to pull it off. > >> I guess this works in any Unixy environment... I have no idea on > whether Windows implements something similar to Process.getpgid, or > for that matter, anything on Windows'' process management.Process.kill(0, pid) also works and is (in my experience) more widely used. -- Eric Wong
Zed A. Shaw wrote: > That would be the ideal situation, but Ruby doesn''t have good enough > process management APIs to do this portably. Erik Hetzner: > ... but not the edge case where a process is running, with > the same owner, but is no longer a mongrel process. I feel obligated to reply. :) PID files suck. I think it''s really stupid that modern operating systems don''t provide some kind of mechanism to automatically delete a file when a process exits (even when it exits abnormally). Anyway, I''ve written a fair share of daemons in the past. What I tend to do is to combine PID files with a number of lock files: - foo.pid. This is obviously the PID file. - foo.lock. This is a lock file whose lock is acquired during the life time of the daemon. If the daemon exits, whether normally or abnormally, the lock on that file is released. To check whether foo.pid is stale, we simply check whether foo.lock is locked. The only way to check whether foo.lock is locked, is to lock it with the non-blocking parameter. If locking fails then it means it''s already locked, meaning that the PID file is not stale. However, this could result in a racing condition. Suppose that you are starting a daemon, while simultaneously checking whether the daemon is already started: 1. The checker acquires a non-blocking lock on foo.lock. This succeeds, so it knows that the PID file is stale. It prints "stale PID file detected" on screen, and is about to release the lock on foo.lock. 2. All of a sudden, before the lock is released, a context switch occurs. The daemon that is being started tries to acquire a lock on foo.lock. This fails because the checker still has the lock, so the daemon thinks that there''s already a daemon running, and exits. So we need another lock file to serialize all PID file related actions: - foo.global.lock So the code for checking whether the daemon''s running is something like this: def check(): lock(foo.global.lock) if try_lock(foo.lock): # Locking succeeded, so we have a stale PID file here. unlock(foo.lock) unlock(foo.global.lock) return nil else: # Locking failed. Process is still running. pid = read_pid_file(foo.pid) # Of course, your code should also check whether the PID file actually exist. unlock(foo.global.lock) return pid Daemon code: lock(foo.global.lock) write_pid_file(foo.pid) lock(foo.lock) unlock(foo.global.lock) main_loop() lock(foo.global.lock) delete_file(foo.pid) unlock(foo.lock) unlock(foo.global.lock) NOTE: lock() creates the lock file if it doesn''t already exist. This works great, even on Windows. The only gotchas are: - flock() doesn''t work over NFS. You''ll have to use some kind of fcntl() call to lock files over NFS, but I''m not sure whether Ruby provides an API for that. - foo.global.lock is never deleted. You cannot safely delete it without creating some kind of racing condition.
Hongli Lai wrote:> This works great, even on Windows. The only gotchas are: > - flock() doesn''t work over NFS. You''ll have to use some kind of fcntl() > call to lock files over NFS, but I''m not sure whether Ruby provides an > API for that. > - foo.global.lock is never deleted. You cannot safely delete it without > creating some kind of racing condition.I forgot to mention that it is safe to delete foo.lock. So the shutdown part of the daemon code should look like this: lock(foo.global.lock) delete_file(foo.pid) unlock(foo.lock) delete_file(foo.lock) # added this line unlock(foo.global.lock)
On Wed, Jun 11, 2008 at 01:25:41AM +0200, Hongli Lai wrote:> PID files suck.Agreed. Just use daemontools or runit or some other process manager - no pidfiles or complicated locking code needed. -- Jos Backus jos at catnook.com
Has anyone considering turning the mongrel_cluster into a process manager daemon? I know that generally many people rely on other applications (such as monit) to ensure that mongrels are up and running, but it seems that integrated process management out of the box would be a large win. The mongrel_cluster could remain running (rather than exiting) and keep track of the running mongrels (potentially restarting them if they die or zombie). At that point, pid files become uneeded for tracking running mongrels. The only exception would be if the mongrel cluster itself dies - at this point it would orphan the child processes and it would up to the cluster to kill off (or resume ownership) of any orphaned processes. thoughts? - scott On Tue, Jun 10, 2008 at 4:50 PM, Jos Backus <jos at catnook.com> wrote:> On Wed, Jun 11, 2008 at 01:25:41AM +0200, Hongli Lai wrote: > > PID files suck. > > Agreed. Just use daemontools or runit or some other process manager - no > pidfiles or complicated locking code needed. > > -- > Jos Backus > jos at catnook.com > _______________________________________________ > Mongrel-users mailing list > Mongrel-users at rubyforge.org > http://rubyforge.org/mailman/listinfo/mongrel-users >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://rubyforge.org/pipermail/mongrel-users/attachments/20080610/9060e1fc/attachment.html>
On Tue, Jun 10, 2008 at 06:24:58PM -0700, Scott Windsor wrote:> Has anyone considering turning the mongrel_cluster into a process manager > daemon?I''m not using this myself (I use standalone daemontools) but mongrel_runit should fit the bill at least somewhat: https://wiki.hjksolutions.com/display/MR/Home -- Jos Backus jos at catnook.com
On Tue, 10 Jun 2008 16:50:39 -0700 Jos Backus <jos at catnook.com> wrote:> On Wed, Jun 11, 2008 at 01:25:41AM +0200, Hongli Lai wrote: > > PID files suck. > > Agreed. Just use daemontools or runit or some other process manager - no > pidfiles or complicated locking code needed.You ever read the code to runit? I wouldn''t touch that thing with a 10'' pole. Haven''t used daemontools though. -- Zed A. Shaw - Hate: http://savingtheinternetwithhate.com/ - Good: http://www.zedshaw.com/ - Evil: http://yearofevil.com/
On Wed, Jun 11, 2008 at 04:23:10PM -0400, Zed A. Shaw wrote:> You ever read the code to runit? I wouldn''t touch that thing with a > 10'' pole. Haven''t used daemontools though.Haven''t looked at runit code, no. Daemontools so far has worked great for me for over a decade. -- Jos Backus jos at catnook.com