Klaus Darilion
2019-Apr-04 21:07 UTC
[nsd-users] questions how NSD slave handles XFR and how to improve XFR-patching performance
Hello! I tried to analyze how NSD as slave handles XFRs. Please correct me if I am wrong. When NSD is running, I see 3 processes, lets call them P1 P2 and P3. It seems that P3 is the "worker", P1 is a "supervisor" and P2 is the "zone handler". Is this correct? What are the correct names you refer to this individual processes? The incoming NOTIFY is received by P3 and signaled to P1. P1 performs the XFR and saves it to disk. Then P2 calls clone() and P4 is generated. I have no idea what P4 is used. P2 reads the xfr file (written to disk by PID1) and aplies the difference to the in-memory zone. P2 deletes the xfr file. P2 calls clone and P5 is generated. P3 and P4 exit() and P5 is the new worker. How do you call P4 and what is it used for? Above scenario I observed on an NSD slave with a huge TLD zone A (2,7GB) and on an NSD with a big TLD zone B (700MB). Frankly, on another NSD which has both zones loaded (and some more smaller zones, ~500MB in total) it looks different. Zone B sends NOTIFYs every 3 seconds. Hence, NSD is more or less permanently applying diffs and fork. But, it behaves different than in the above scenario. There is not only P4 and P5, but also P6. # pstree -p | grep nsd |-nsd(12456)---nsd(12457)-+-nsd(8728) | |-nsd(9393) | `-nsd(11013) looks normal, but some seconds later: |-nsd(12456)---nsd(12457)-+-nsd(8728) | |-nsd(9393) | |-nsd(11013) | `-nsd(13424) and some seconds later, 2 processes exit and a new is forked: |-nsd(12456)---nsd(12457)-+-nsd(11013) | |-nsd(13424) | `-nsd(14970) Is this a legal scenario? While analyzing the logs it seems to me that, as the NOTIFYs are faster received than NSD can apply and fork them, that a second XFR is initiated while NSD is still waiting for the previous XFR to be finished (not sure, but I think the clone() is the problem which takes 4-5 seconds). So, How is NSD supposed to work when receiving NOTIFYs although the last XFR is not finished yet? Is it a feature or a bug to start the next XFR although the previous is not finished yet? With strace i sometimes see aborted clone(), may this be the sign of some problem? Ie: [pid 12457] 19:56:44.043771 <... clone resumed> child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f324b19ba10) = ? ERESTARTNOINTR (To be restarted) <5.683020> May it be that a clone() is aborted as there is a newer zone version and hence the XFR is applied and clone() restarted? If this is the case, it can happen that a clone() never finished as a new XFR is available. Any ideas to improve the XFR-applying-speed? I think the most time consuming part is the clone() (P2 has around 2.3million page tables). Are there and config/build options to improve this? Ie huge pages? Using some other forking function? Thanks Klaus
Anand Buddhdev
2019-Apr-08 06:57 UTC
[nsd-users] questions how NSD slave handles XFR and how to improve XFR-patching performance
On 04/04/2019 23:07, Klaus Darilion wrote: Good morning Klaus! [snip]> Any ideas to improve the XFR-applying-speed? I think the most time > consuming part is the clone() (P2 has around 2.3million page tables). > Are there and config/build options to improve this? Ie huge pages? Using > some other forking function?I can't comment on improving NSD's XFR-handling model, but you could throttle it by setting a larger value for "xfrd-reload-timeout" in nsd.conf. The default is 1 second, but if you set it to something like 30 seconds, then NSD will receive and apply the XFRs received, but only reload once every 30s, and allow the server some breathing room. Regards, Anand Buddhdev RIPE NCC