On 02.03.2012 02:13, Anand Buddhdev wrote:> This questions is aimed more at the NSD developers, but of course if
> anyone knows the answer, feel free to chime in.
>
> While writing some code to process DNS queries and responses over TCP,
> one of my colleagues noticed something strange about NSD's TCP
> responses. Here's what we have observed:
>
> client: syn
> server: syn + ack
> client: ack
> client: push + ack + query
> server: ack
> server: ack + 2 bytes indicating size of following dns message
> client: ack
> server: push + ack + response
>
> I'm omitting the closing sequence of FINs and ACKs here.
>
> Comparing this to a BIND server, we see:
>
> client: syn
> server: syn + ack
> client: ack
> client: push + ack + query
> server: push + ack + 2 bytes + response
>
> Notice how NSD uses an extra TCP segment to send just the 2 bytes
> indicating the length of the response packet, whereas BIND does it all
> in the same TCP segment. BIND's behaviour seems logical to me, whereas
> NSD's seems... strange.
>
> Is there any reason NSD does it this way? TCP performance isn't really
> an issue for us, so I don't see any immediate need to fix this, if
> indeed a fix is even needed. We'd just like to understand this
> difference in behaviour.
There is no strong reason why NSD _should_ do this the way BIND
does it.
TCP is a STREAM of bytes, the "packetizing" of TCP is not specified
in any standard at all. An application can use various system-
specific methods to express its preference (like TCP_CORK in linux),
but even with these specified, TCP stack is allowed to divide the
stream into packets more or less arbitrary.
So the client shoud be prepared for even the worst case scenario,
ie, should be able to read whole thing byte at a time.
The way BIND does it is merely an optimization, in an attempt to
minimize network roundtrips, and it is in no way mandatory again.
As for optimization itself. NSD should be prepared for the write
to fail with EAGAIN at any time, which means the kernel send buffer
is full, so NSD will have to repeat the write from the position
where it stopped. It is easy if we're writing just one buffer of
data. But we've two: the size and the data itself.
There are at least 2 ways to deal with it.
First, there's already mentioned TCP_CORK which can be used on
linux (if it isn't already). It is relatively easy: set it to
on before attempting to send size, and to off when done sending
this reply.
Another option is to use writev() and make it restartable from
arbitrary position. For example, the way how it is done in qemu:
http://git.qemu.org/?p=qemu.git;a=blob;f=cutils.c;hb=HEAD
(there, see do_sendv_recvv() function, and note it is writev()
but has extra argument, "offset").
But these are possible implementation details of an _optimization_,
not of a bugfix.
Thanks,
/mjt