thr3ads.net - nsd users - [nsd-users] Edge case on nsdc? [Jul 2008]

If this information is useful, please help other people find it:
Share via:

Shane Kerr

2008-Jul-08 12:16 UTC

[nsd-users] Edge case on nsdc?

Hello,

We have this in our NSD logs occasionally:

[1214740996] nsd[93921]: warning: nsd is already running as 93888,  
continuing
[1214740996] nsd[93922]: error: can't bind the socket: Address already  
in use
[1214741027] nsd[94418]: error: can't bind the socket: Address already  
in use
[1214741057] nsd[94932]: error: can't bind the socket: Address already  
in use


I think this is because we have a script monitoring to make sure NSD  
is running at all time and attempts to start it... even though NSD is  
already running.


In the nsdc.sh script we see the following:


signal() {
         if [ -s ${pidfile} ]
         then
                 kill -"$1" `cat ${pidfile}` && return 0
         else
                 echo "nsd is not running"
         fi
         return 1
}


But it seems like NSD restarts itself regularly, getting a new process  
ID when it does so. In this case, we have the possibility for the  
following to happen:

- nsdc.sh reads the contents of pidfile

- NSD restarts, getting a new PID

- nsdc.sh sends a signal to test NSD using the old PID, which fails,  
so nsdc claims NSD is not running

Is this possible?



It is possible to work around this with a little more sophistication,  
I think:

signal() {
	while true
	do
		# if there is no PID file, NSD is not running
		if [ ! -s ${pidfile} ]
		then
			return 1
                 fi

		# if we can send the signal to the PID, then NSD is running
                 #   (or some other process with that PID, but we hope  
not...)
		PID=`cat ${pidfile}`
		if kill -"$1" $PID
		then
			 return 0
		fi

		# double-check NSD did not restart between the time we read the PID
		# and the time we sent the signal
		CHECK_PID=`cat ${pidfile}`
		if [ $PID -eq $CHECK_PID ]
		then
			echo "nsd is not running"
			return 1
		fi
	done
}

--
Shane

Shane Kerr

2008-Jul-09 22:51 UTC

head link

[nsd-users] Edge case on nsdc?

Matthijs,

On Jul 9, 2008, at 12:22 +0200, Matthijs Mekking wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Hi Shane,
>
>> [1214740996] nsd[93921]: warning: nsd is already running as 93888,
>> continuing
>> [1214740996] nsd[93922]: error: can't bind the socket: Address  
>> already
>> in use
>> [1214741027] nsd[94418]: error: can't bind the socket: Address  
>> already
>> in use
>> [1214741057] nsd[94932]: error: can't bind the socket: Address  
>> already
>> in use
>
> This occurs when you call nsd manually (eg without nsdc, NSD control
> script). Because NSD is already running, it can't bind the socket, and
> server initialization for this process fails. Because server
> initialization fails, it tries to remove the pidfile. Hence, later you
> will only see the socket bind error, and no longer the 'already  
> running'
> warning. (and therefore, nsdc running will tell you it is not running)
>
> I changed in nsd.c that the pidfile is written only after succeeding
> server initialization.
Cool.
>> I think this is because we have a script monitoring to make sure  
>> NSD is
>> running at all time and attempts to start it... even though NSD is
>> already running.
>
> What script do you use for monitoring NSD? nsdc also can be used for
> this. nsdc running to check if nsd is running, if it returns 1 (not
> running), you can do nsdc start.
We use nsdc for this. The script basically does:

while true; do
     if ! nsdc running; then
         nsdc start
     fi
     sleep 15
done
>> In the nsdc.sh script we see the following:
>>
>>
>> signal() {
>>        if [ -s ${pidfile} ]
>>        then
>>                kill -"$1" `cat ${pidfile}` && return
0
>>        else
>>                echo "nsd is not running"
>>        fi
>>        return 1
>> }
>>
>>
>> But it seems like NSD restarts itself regularly, getting a new  
>> process
>> ID when it does so. In this case, we have the possibility for the
>> following to happen:
>>
>> - nsdc.sh reads the contents of pidfile
>>
>> - NSD restarts, getting a new PID
>>
>> - nsdc.sh sends a signal to test NSD using the old PID, which  
>> fails, so
>> nsdc claims NSD is not running
>>
>> Is this possible?
>
> As far as I know, when NSD restarts (because it received a dedicated
> signal), it takes care of updating the pidfile.
When you use "nsdc patch", you get an implicit "nsdc
reload". We run
this from a cron job.

nsdc reload issues a SIGHUP to NSD.

This eventually ends up in the server_main() function in server.c,  
which calls fork(), and therefore gets a new pid, which it then writes  
into the pidfile.

So, the scenario is:

Time 1: NSD, running as PID A, writes into pidfile
Time 2: nsdc reads PID A from pidfile
Time 3: NSD gets a SIGHUP, forks a new process with PID B, and exits  
the old process
Time 4: nsdc sends a signal to PID A, which no longer exists
Time 5: nsdc returns "server not running" even though the server is  
running.
>> It is possible to work around this with a little more  
>> sophistication, I
>> think:
>>
>> signal() {
>>    while true
>>    do
>>        # if there is no PID file, NSD is not running
>>        if [ ! -s ${pidfile} ]
>>        then
>>            return 1
>>                fi
>>
>>        # if we can send the signal to the PID, then NSD is running
>>                #   (or some other process with that PID, but we hope
>> not...)
>>        PID=`cat ${pidfile}`
>>        if kill -"$1" $PID
>>        then
>>             return 0
>>        fi
>>
>>        # double-check NSD did not restart between the time we read  
>> the PID
>>        # and the time we sent the signal
>>        CHECK_PID=`cat ${pidfile}`
>>        if [ $PID -eq $CHECK_PID ]
>>        then
>>            echo "nsd is not running"
>>            return 1
>>        fi
>>    done
>> }
>
> Could you try the trunk release? I think it already fixes this issue.
> Make sure your control script first checks if nsd is running (nsdc
> running) and if not start it (nsdc start).
The fix you made makes sense, and should be included.

But I am reasonably sure there is nothing that the server can do to  
fix this problem (mind you I am a bit sleep-deprived right now, so no  
promises). ;)

I think the script needs to work like I coded it here, where it checks  
the PID of the server did not change while it was checking.

--
Shane

nsd users - Jul 2008 - Edge case on nsdc?

[nsd-users] Edge case on nsdc?

[nsd-users] Edge case on nsdc?