Hi NSD developers and users,
I've observed a situation with NSD that I think deserves some attention,
and perhaps some kind of fix.
We have a server with 32GB of RAM. When we start NSD, it loads all the
zones, and happily serves them. It uses close to 15GB of RAM. After a
while, it gets a NOTIFY for a zone, and AXFRs the zone. It saves the XFR
in /var/lib/nsd/nsd-xfr-5231. It then tries to apply the update, and
this is when it all goes wrong. NSD's method of updating is to fork
itself, have the child reload the changed zone(s), and take over from
the parent... except that it can't fork because of memory shortage.
While forking, NSD temporarily uses double the amount of RAM.
The log shows this:
[2022-03-30 15:16:27.986] nsd[5299]: error: fork failed: Cannot allocate
memory
[2022-03-30 15:16:28.355] nsd[45999]: error: handle_reload_cmd: reload
closed cmd channel
[2022-03-30 15:16:28.355] nsd[45999]: warning: Reload process 5299
failed, continuing with old database
[2022-03-30 15:16:28.355] nsd[5231]: error: process 5299 exited with
status 256
[2022-03-30 15:16:29.776] nsd[45999]: error: fork failed: Cannot
allocate memory
[2022-03-30 15:16:30.149] nsd[46012]: error: handle_reload_cmd: reload
closed cmd channel
[2022-03-30 15:16:30.149] nsd[46012]: warning: Reload process 45999
failed, continuing with old database
[2022-03-30 15:16:31.748] nsd[46013]: error: handle_reload_cmd: reload
closed cmd channel
[2022-03-30 15:16:31.748] nsd[46013]: warning: Reload process 46012
failed, continuing with old database
After this, there are no more log entries about trying to reload the
database.
PID 5231 is the xfrd process, and 5299 was the master that coordinates
things. Now, the situation looks like this:
# systemctl status nsd
? nsd.service - NSD DNS Server
Loaded: loaded (/usr/lib/systemd/system/nsd.service; enabled; vendor
preset: disabled)
Active: active (running) since Tue 2022-01-04 12:07:30 UTC; 2 months
28 days ago
Main PID: 5231 (nsd: xfrd)
CGroup: /system.slice/nsd.service
?? 5231 /usr/sbin/nsd -d
??46013 /usr/sbin/nsd -d
??46016 /usr/sbin/nsd -d
??46024 /usr/sbin/nsd -d
So we have the state where the xfrd process is running, and keeps doing
zone transfers, which slowly accumulate in /var/lib/nsd/nsd-xfr-5231.
Eventually, this will fill up the disk. Additionally, we have child
processes running and serving queries, but the zones are now outdated.
But there is no master process to apply the transfers. Log file rotation
is also broken, because when I run "nsd-control log_reopen", no new
log
file is created. This will also cause the log file to grow unbounded,
until it fills up the disk. Essentially, NSD is crippled, and only a
restart will get it out of this broken state.
The easiest way to prevent this is to add RAM to the server. But my
opinion is that this is a waste of resources. It may also not be trivial
to do so. It might be easier on a virtual server, but with a physical
server, one needs to buy RAM, shut down the server and add the memory
modules. In this area, I find NSD to be deficient. Other name servers
handle their memory differently, and make incremental use of memory as
zones are added.
A question for the developers is: is there any way to make NSD handle
zone reloads more efficiently rather than doing this fork/reload?
Regards,
Anand