Ville Mattila
2009-Sep-14 14:25 UTC
[nsd-users] NSD as slave leaves a child process behind
Hi, We've run into problems with NSD v3.2.3 (on Red Hate Enterprise Linux 5.4 x86_64) failing to kill one of it's children processes while NSD is reloading after it has received an update to a zone from master. Everything seems to running fine, but 'nsdc stop' and 'nsdc patch' etc. really don't work because the reload updates NSD the pid file but the child that was left behind really is handling all the queries and zone updates. Our NSD is a slave to some thousands of zones and because of the way we automatically create NSD configuration (we configure NSD to use every other NS of the zone as master even though typically only one of the other NS servers of a zone actually allows an AXFR for us), our NSD typically has n*10 zone transfers in SYN_SENT state destined to fail after connect timeout and for a lot of AXFR requests our NSD gets REFUSED response. I'd appreciate help hunting down the cause for this problem; we need to get 'nsdc stop && nsdc start' working to be able to restart NSD reliably after generating new nsd.conf automatically. (Of course I could write a script which kills NSD some other way instead of 'nsdc stop' but I'd prefer fixing NSD/nsdc/our environment.) ----- Here's how the problem evolves on our systems: 1. Initially (after e.g. server reboot) NSD is running fine. Three nsd processes (with default server-count setting), one of which (pid 20925 here) is the parent to the other two (20933 and 20934). The parent NSD owns all tcp LISTEN and udp sockets, and child 20933 is the only one trying to do some AXFRs from masters: % fgrep 'nsd[20925]' /var/log/messages Sep 14 14:59:50 isar nsd[20925]: nsd started (NSD 3.2.3), pid 20925 % ps -ef | grep nsd.conf nsd 20925 1 56 14:59 ? 00:00:04 /usr/sbin/nsd -c /v/net/ns-secondary.funet.fi/etc/nsd/nsd-isar.conf nsd 20933 20925 11 14:59 ? 00:00:00 /usr/sbin/nsd -c /v/net/ns-secondary.funet.fi/etc/nsd/nsd-isar.conf nsd 20934 20925 1 14:59 ? 00:00:00 /usr/sbin/nsd -c /v/net/ns-secondary.funet.fi/etc/nsd/nsd-isar.conf % sudo netstat -anp | grep nsd | grep -v -E '(LISTEN|ESTABLISHED)' tcp 0 0 128.214.248.137:53 0.0.0.0:* LISTEN 20925/nsd tcp 0 0 193.167.245.84:53 0.0.0.0:* LISTEN 20925/nsd tcp 0 0 195.148.12.100:53 0.0.0.0:* LISTEN 20925/nsd tcp 0 0 2001:708:10:70::55:1:53 :::* LISTEN 20925/nsd tcp 0 0 2001:708:10:55::53:53 :::* LISTEN 20925/nsd udp 0 0 128.214.248.137:53 0.0.0.0:* 20925/nsd udp 0 0 193.167.245.84:53 0.0.0.0:* 20925/nsd udp 0 0 195.148.12.100:53 0.0.0.0:* 20925/nsd udp 0 0 2001:708:10:70::55:1:53 :::* 20925/nsd udp 0 0 2001:708:10:55::53:53 :::* 20925/nsd % sudo netstat -anp -A inet,inet6 | grep nsd | grep SYN_SENT (master server IP addresses obfuscated as masterX) tcp 0 1 128.214.248.137:57130 masterA:53 SYN_SENT 20933/nsd tcp 0 1 128.214.248.137:53660 masterA:53 SYN_SENT 20933/nsd tcp 0 1 128.214.248.137:37449 masterB:53 SYN_SENT 20933/nsd tcp 0 1 128.214.248.137:50400 masterC:53 SYN_SENT 20933/nsd tcp 0 1 128.214.248.137:48250 masterD:53 SYN_SENT 20933/nsd tcp 0 1 128.214.248.137:35477 masterE:53 SYN_SENT 20933/nsd tcp 0 1 128.214.248.137:34076 masterE:53 SYN_SENT 20933/nsd tcp 0 1 128.214.248.137:48529 masterE:53 SYN_SENT 20933/nsd tcp 0 1 128.214.248.137:48535 masterE:53 SYN_SENT 20933/nsd tcp 0 1 128.214.248.137:35901 masterD:53 SYN_SENT 20933/nsd tcp 0 1 128.214.248.137:37632 masterF:53 SYN_SENT 20933/nsd tcp 0 1 128.214.248.137:47745 masterE:53 SYN_SENT 20933/nsd tcp 0 1 128.214.248.137:32931 masterF:53 SYN_SENT 20933/nsd tcp 0 1 128.214.248.137:49364 masterG:53 SYN_SENT 20933/nsd tcp 0 1 128.214.248.137:37719 masterD:53 SYN_SENT 20933/nsd tcp 0 1 128.214.248.137:36117 masterF:53 SYN_SENT 20933/nsd tcp 0 1 128.214.248.137:37209 masterD:53 SYN_SENT 20933/nsd tcp 0 1 128.214.248.137:45596 masterG:53 SYN_SENT 20933/nsd tcp 0 1 128.214.248.137:60195 masterD:53 SYN_SENT 20933/nsd tcp 0 1 128.214.248.137:54687 masterE:53 SYN_SENT 20933/nsd tcp 0 1 128.214.248.137:54131 masterF:53 SYN_SENT 20933/nsd tcp 0 1 128.214.248.137:60484 masterE:53 SYN_SENT 20933/nsd tcp 0 1 128.214.248.137:55516 masterD:53 SYN_SENT 20933/nsd tcp 0 1 128.214.248.137:50900 masterF:53 SYN_SENT 20933/nsd tcp 0 1 128.214.248.137:36627 masterG:53 SYN_SENT 20933/nsd 2. Soon NSD receives an update to one of the zones and reloads, and updates the pid file: (from /var/log/messages:) Sep 14 15:01:55 isar nsd[20925]: signal received, reloading... Sep 14 15:01:55 isar nsd[21072]: memory recyclebin holds 535424 bytes % cat $(nsd-checkconf -o pidfile /v/net/ns-secondary.funet.fi/etc/nsd/nsd-isar.conf) 21072 3. Now observe the child pid 20933 still is running directly under init process (pid 1). The child 20933 now owns the LISTEN sockets. % ps -ef | grep nsd.conf nsd 20933 1 0 14:59 ? 00:00:01 /usr/sbin/nsd /v/net/ns-secondary.funet.fi/etc/nsd/nsd-isar.conf nsd 21072 1 2 15:01 ? 00:00:00 /usr/sbin/nsd /v/net/ns-secondary.funet.fi/etc/nsd/nsd-isar.conf nsd 21073 21072 1 15:01 ? 00:00:00 /usr/sbin/nsd /v/net/ns-secondary.funet.fi/etc/nsd/nsd-isar.conf % sudo netstat -anp | grep nsd | grep -v -E '(LISTEN|ESTABLISHED)' tcp 0 0 128.214.248.137:53 0.0.0.0:* LISTEN 20933/nsd tcp 0 0 193.167.245.84:53 0.0.0.0:* LISTEN 20933/nsd tcp 0 0 195.148.12.100:53 0.0.0.0:* LISTEN 20933/nsd tcp 0 0 2001:708:10:70::55:1:53 :::* LISTEN 20933/nsd tcp 0 0 2001:708:10:55::53:53 :::* LISTEN 20933/nsd udp 0 0 128.214.248.137:53 0.0.0.0:* 20933/nsd udp 0 0 193.167.245.84:53 0.0.0.0:* 20933/nsd udp 0 0 195.148.12.100:53 0.0.0.0:* 20933/nsd udp 0 0 2001:708:10:70::55:1:53 :::* 20933/nsd udp 0 0 2001:708:10:55::53:53 :::* 20933/nsd Steps 2-4 repeat over and over: process 20933 keeps on running and responding to DNS queries. I'm not sure if process 20933 should terminate in reloads but for sure nsdc cannot find it because nsdc depends on pidfile being correct. ----- Regards, -- Ville Mattila, System Specialist, Funet network, CSC PO Box 405, FIN-02101 Espoo, Finland, fax +358 9 457 2302 CSC is the Finnish IT Center for Science, http://www.csc.fi/, email: ville.mattila at csc.fi
Matthijs Mekking
2009-Sep-15 10:03 UTC
[nsd-users] NSD as slave leaves a child process behind
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Hi Ville, It may look like a child process is not killed after a reload, but isn't the process you are referring to (20933), the xfrd process? By default, NSD has three processes: a parent process, a child process (answering queries) and a xfrd process. During a NSD reload, the xfrd process should not be killed, merely the communication channels with the parent are updated. Of course, if you run nsdc stop, no processes should be kept alive. Is this the case? What does nsdc stop output? Notice that nsdc stop && nsdc start differs from nsdc restart. The latter attempts to start nsd after all processes are shut down. Best regards, Matthijs Mekking NLnet Labs Ville Mattila wrote:> Hi, > > We've run into problems with NSD v3.2.3 (on Red Hate Enterprise Linux > 5.4 x86_64) failing to kill one of it's children processes while NSD > is reloading after it has received an update to a zone from master. > Everything seems to running fine, but 'nsdc stop' and 'nsdc patch' > etc. really don't work because the reload updates NSD the pid file but > the child that was left behind really is handling all the queries > and zone updates. > > Our NSD is a slave to some thousands of zones and because of the way we > automatically create NSD configuration (we configure NSD to use every > other NS of the zone as master even though typically only one of the other > NS servers of a zone actually allows an AXFR for us), our NSD typically > has n*10 zone transfers in SYN_SENT state destined to fail after connect > timeout and for a lot of AXFR requests our NSD gets REFUSED response. > > I'd appreciate help hunting down the cause for this problem; we need to > get 'nsdc stop && nsdc start' working to be able to restart NSD reliably > after generating new nsd.conf automatically. (Of course I could write > a script which kills NSD some other way instead of 'nsdc stop' but I'd > prefer fixing NSD/nsdc/our environment.) > > ----- > Here's how the problem evolves on our systems: > > 1. Initially (after e.g. server reboot) NSD is running fine. Three nsd > processes (with default server-count setting), one of which (pid 20925 > here) is the parent to the other two (20933 and 20934). The parent NSD > owns all tcp LISTEN and udp sockets, and child 20933 is the only one > trying to do some AXFRs from masters: > > % fgrep 'nsd[20925]' /var/log/messages > Sep 14 14:59:50 isar nsd[20925]: nsd started (NSD 3.2.3), pid 20925 > > % ps -ef | grep nsd.conf > nsd 20925 1 56 14:59 ? 00:00:04 /usr/sbin/nsd -c > /v/net/ns-secondary.funet.fi/etc/nsd/nsd-isar.conf > nsd 20933 20925 11 14:59 ? 00:00:00 /usr/sbin/nsd -c > /v/net/ns-secondary.funet.fi/etc/nsd/nsd-isar.conf > nsd 20934 20925 1 14:59 ? 00:00:00 /usr/sbin/nsd -c > /v/net/ns-secondary.funet.fi/etc/nsd/nsd-isar.conf > > % sudo netstat -anp | grep nsd | grep -v -E '(LISTEN|ESTABLISHED)' > tcp 0 0 128.214.248.137:53 0.0.0.0:* LISTEN > 20925/nsd > tcp 0 0 193.167.245.84:53 0.0.0.0:* LISTEN > 20925/nsd > tcp 0 0 195.148.12.100:53 0.0.0.0:* LISTEN > 20925/nsd > tcp 0 0 2001:708:10:70::55:1:53 :::* LISTEN > 20925/nsd > tcp 0 0 2001:708:10:55::53:53 :::* LISTEN > 20925/nsd > udp 0 0 128.214.248.137:53 0.0.0.0:* > 20925/nsd > udp 0 0 193.167.245.84:53 0.0.0.0:* > 20925/nsd > udp 0 0 195.148.12.100:53 0.0.0.0:* > 20925/nsd > udp 0 0 2001:708:10:70::55:1:53 :::* > 20925/nsd > udp 0 0 2001:708:10:55::53:53 :::* > 20925/nsd > > % sudo netstat -anp -A inet,inet6 | grep nsd | grep SYN_SENT > (master server IP addresses obfuscated as masterX) > tcp 0 1 128.214.248.137:57130 masterA:53 SYN_SENT > 20933/nsd > tcp 0 1 128.214.248.137:53660 masterA:53 SYN_SENT > 20933/nsd > tcp 0 1 128.214.248.137:37449 masterB:53 SYN_SENT > 20933/nsd > tcp 0 1 128.214.248.137:50400 masterC:53 SYN_SENT > 20933/nsd > tcp 0 1 128.214.248.137:48250 masterD:53 SYN_SENT > 20933/nsd > tcp 0 1 128.214.248.137:35477 masterE:53 SYN_SENT > 20933/nsd > tcp 0 1 128.214.248.137:34076 masterE:53 SYN_SENT > 20933/nsd > tcp 0 1 128.214.248.137:48529 masterE:53 SYN_SENT > 20933/nsd > tcp 0 1 128.214.248.137:48535 masterE:53 SYN_SENT > 20933/nsd > tcp 0 1 128.214.248.137:35901 masterD:53 SYN_SENT > 20933/nsd > tcp 0 1 128.214.248.137:37632 masterF:53 SYN_SENT > 20933/nsd > tcp 0 1 128.214.248.137:47745 masterE:53 SYN_SENT > 20933/nsd > tcp 0 1 128.214.248.137:32931 masterF:53 SYN_SENT > 20933/nsd > tcp 0 1 128.214.248.137:49364 masterG:53 SYN_SENT > 20933/nsd > tcp 0 1 128.214.248.137:37719 masterD:53 SYN_SENT > 20933/nsd > tcp 0 1 128.214.248.137:36117 masterF:53 SYN_SENT > 20933/nsd > tcp 0 1 128.214.248.137:37209 masterD:53 SYN_SENT > 20933/nsd > tcp 0 1 128.214.248.137:45596 masterG:53 SYN_SENT > 20933/nsd > tcp 0 1 128.214.248.137:60195 masterD:53 SYN_SENT > 20933/nsd > tcp 0 1 128.214.248.137:54687 masterE:53 SYN_SENT > 20933/nsd > tcp 0 1 128.214.248.137:54131 masterF:53 SYN_SENT > 20933/nsd > tcp 0 1 128.214.248.137:60484 masterE:53 SYN_SENT > 20933/nsd > tcp 0 1 128.214.248.137:55516 masterD:53 SYN_SENT > 20933/nsd > tcp 0 1 128.214.248.137:50900 masterF:53 SYN_SENT > 20933/nsd > tcp 0 1 128.214.248.137:36627 masterG:53 SYN_SENT > 20933/nsd > > > 2. Soon NSD receives an update to one of the zones and reloads, and > updates the pid file: > > (from /var/log/messages:) > Sep 14 15:01:55 isar nsd[20925]: signal received, reloading... > Sep 14 15:01:55 isar nsd[21072]: memory recyclebin holds 535424 bytes > > % cat $(nsd-checkconf -o pidfile > /v/net/ns-secondary.funet.fi/etc/nsd/nsd-isar.conf) > 21072 > > 3. Now observe the child pid 20933 still is running directly under init > process (pid 1). > The child 20933 now owns the LISTEN sockets. > > % ps -ef | grep nsd.conf > nsd 20933 1 0 14:59 ? 00:00:01 /usr/sbin/nsd > /v/net/ns-secondary.funet.fi/etc/nsd/nsd-isar.conf > nsd 21072 1 2 15:01 ? 00:00:00 /usr/sbin/nsd > /v/net/ns-secondary.funet.fi/etc/nsd/nsd-isar.conf > nsd 21073 21072 1 15:01 ? 00:00:00 /usr/sbin/nsd > /v/net/ns-secondary.funet.fi/etc/nsd/nsd-isar.conf > > % sudo netstat -anp | grep nsd | grep -v -E '(LISTEN|ESTABLISHED)' > tcp 0 0 128.214.248.137:53 0.0.0.0:* LISTEN > 20933/nsd > tcp 0 0 193.167.245.84:53 0.0.0.0:* LISTEN > 20933/nsd > tcp 0 0 195.148.12.100:53 0.0.0.0:* LISTEN > 20933/nsd > tcp 0 0 2001:708:10:70::55:1:53 :::* LISTEN > 20933/nsd > tcp 0 0 2001:708:10:55::53:53 :::* LISTEN > 20933/nsd > udp 0 0 128.214.248.137:53 0.0.0.0:* > 20933/nsd > udp 0 0 193.167.245.84:53 0.0.0.0:* > 20933/nsd > udp 0 0 195.148.12.100:53 0.0.0.0:* > 20933/nsd > udp 0 0 2001:708:10:70::55:1:53 :::* 20933/nsd > udp 0 0 2001:708:10:55::53:53 :::* 20933/nsd > > Steps 2-4 repeat over and over: process 20933 keeps on running and > responding to DNS queries. > > I'm not sure if process 20933 should terminate in reloads but for sure > nsdc cannot find it because nsdc depends on pidfile being correct. > ----- > > Regards,-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iQEcBAEBAgAGBQJKr2VrAAoJEA8yVCPsQCW5/FMIAKVpwdZzN4PcHGtTGxE/oJ/J Wl68IWKdit43Taj6F5T120Zay2x/YvaFaA/yQqosHNgQ7HwuJGTcARRDY1RD2eyr rOXXLo26+istlrN0EVgIeyZshTqTlZ05lFtA/EBVpLOIsGWG2CFEF8dGPkFRDPYY BxFt6RWrfluJsuFOcMy7FS05Hczvp0iAtuoFjrJJF6R9w1RtOGakCkG2oMzePlZr ZKe2us303sb/QFNX5XF/yCyQf7DWg+fvbuZnfdGNIjnxQN4dUQdS644P6qiDGPwy +lSyzQQjl1MH3nMBmv/dTbADc9yJ1Ef2OTZV2H7xWxsVsWgiQF+VpSkztwGHN34=rWQR -----END PGP SIGNATURE-----