Ian Lesperance
2008-Jul-03 20:33 UTC
[Backgroundrb-devel] PID File Overwritten on Failed Start
I have monitoring in place for BackgrounDRb to ensure that it stays up, but I''ve been getting some false alarms lately. I''ve realized that it has to do with the way BackrounDRb daemonizes. If you attempt to start BackgrounDRb while it''s already running, it''s going to (1) write its new PID to the file then (2) fail with Errno::EADDRINUSE upon attempting to establish a socket connection. Because of a small deployment race condition, sometimes my monitoring software attempts to start BackgrounDRb along with my actual deployment scripts. This causes an invalid PID to get written to the file. Since my monitoring software uses this PID file to determine the status of BackgrounDRb, it keeps sending out false outage alerts and attempting (and failing) to restart BackgrounDRb. Now, one quick and simple solution to this is to store the old PID in a variable, rescue the Errno::EADDRINUSE, and restore the old PID. I''ve already written a patch that does just that. However, is there any reason it should even attempt to start in the presence of an existing process? If not, then I could just use something like Process.getpgid() to check if the old process still exists and abort the start before. Thoughts? Ian
Ian Lesperance
2008-Jul-03 22:40 UTC
[Backgroundrb-devel] PID File Overwritten on Failed Start
Actually, I realized there''s still a race condition here, albeit much smaller than before. If two fresh starts are attempted at the same time, neither would see a PID in the file, which means the problem would still exist if the one that fails was also the one that wrote to the file last, since it would not be able to revert the PID. Since the window for that to happen is extremely narrow, I would argue that that particular race is not worth fretting over. But if any of you disagree on the severity, then I''d be happy to continue discussing it further. As it stands, though, I think checking process existence and/or catching EADDRINUSE will be enough for my needs. Ian On Thu, Jul 3, 2008 at 1:33 PM, Ian Lesperance <ian.lesperance at gmail.com> wrote:> I have monitoring in place for BackgrounDRb to ensure that it stays > up, but I''ve been getting some false alarms lately. I''ve realized > that it has to do with the way BackrounDRb daemonizes. If you attempt > to start BackgrounDRb while it''s already running, it''s going to (1) > write its new PID to the file then (2) fail with Errno::EADDRINUSE > upon attempting to establish a socket connection. > > Because of a small deployment race condition, sometimes my monitoring > software attempts to start BackgrounDRb along with my actual > deployment scripts. This causes an invalid PID to get written to the > file. Since my monitoring software uses this PID file to determine > the status of BackgrounDRb, it keeps sending out false outage alerts > and attempting (and failing) to restart BackgrounDRb. > > Now, one quick and simple solution to this is to store the old PID in > a variable, rescue the Errno::EADDRINUSE, and restore the old PID. > I''ve already written a patch that does just that. > > However, is there any reason it should even attempt to start in the > presence of an existing process? If not, then I could just use > something like Process.getpgid() to check if the old process still > exists and abort the start before. > > Thoughts? > > Ian >