The two write()s in NSD's handle_tcp_writing() run afowl of the Nagle
Algorithm (RFC 896). This seems to result in delaying every answer to
queries received over TCP by an extra network round-trip time according to
tcpdump.
On by default on TCP connections, Nagle is a simple fix to having too many
small packets in flight. Whenever data is written to a connection, it is
sent immediately unless Nagle is enabled (which it is by default) and there
is already un-ACKed data in flight to the remote side.
The first write() of the two-byte size (server.c line 974) gets sent
immediately in its own packet, and the write() of the actual data (server.c
line 1005) will get delayed until the first packet gets ACKed.
The easiest fix for the extra latency is to disable Nagle (setsockopt
TCP_NODELAY) on every TCP connection, but that would still send an extra TCP
packet unnecessarily. There are OS-specific hacks to delay sending of
packets between multiple write()s (like Linux's TCP_CORK), but the cleanest
and most portable approach would probably be to get NSD to only write() once
per answer.
It looks like putting the length into the same buffer as the data about to
be sent would take care of that, but I'm not familiar enough with the code
to make that happen. Is this feasible?
-- Aaron