I would like to discuss the following commit:
https://github.com/gsliepen/tinc/commit/4a0b9981513059755b9fd15b38fc198f46a0d6f2
("Determine peer's reflexive address and port when exchanging
keys")
This is a great feature as it basically allows peers to do UDP Hole
Punching (via MTU probes) even when both are having their source ports
rewritten by a NAT, which is extremely useful. However, I have a few
questions and concerns about the way it's currently implemented in
ANS_KEY messages:
- AFAIK, once a key is negotiated using REQ_KEY and ANS_KEY messages,
and barring any changes to the graph, it will only be negotiated again
<KeyExpire> seconds later, which by default is one hour. This means that
the UDP port information is only updated once an hour, which seems like
a very long time. There are situations in which the port could suddenly
change. For example, consider a typical NAT box that will "forget"
about
UDP sessions after some period of inactivity, and will assign a
different port when tinc starts sending packets again. Or a redundant
NAT setup that fails over, losing its state in the process. As a general
note, any solution that relies on NATs not being broken is doomed to
fail in at least some situations IMHO. My point is, tinc should be able
to react to any change in network conditions quickly and efficiently.
- Isn't there a race condition in which a node requests a key, gets it,
but the intermediate nodes didn't have time to determine the UDP address
yet? When an intermediate node finds out what the UDP address is a few
seconds later, it is already too late and it won't get any opportunity
to tell the node about the address for a full hour.
- When a node receives an ANS_KEY message with UDP address information,
it will use this information to override the node's address in its own
structures, no questions asked. Shouldn't it be smarter than that? For
example, if LocalDiscovery yields a local address, the node will use it
until it gets an ANS_KEY message for the node, in which case it will
suddenly revert back to non-local communication until LocalDiscovery
kicks in again. Shouldn't there be some logic in ans_key_h() to prevent
it from updating the node's UDP address if n->status.udp_confirmed is
already true?
- If I understand the code right, when local communication (as in
LocalDiscovery) between two nodes suddenly stop working, they will
potentially fall back to indirect communication instead of direct
non-local communication, because they forgot about the address they got
from ANS_KEY (which was overridden by LocalDiscovery). Granted, that
would be an extremely weird situation (local communication suddenly
stops working but non-local UDP hole punching is okay? seems unlikely)
which we probably don't really care about.
I am wondering if it would be cleaner to store the node's real UDP
address in edge_t::address, and then if it gets updated, an ADD_EDGE
message is automatically broadcast to inform the whole graph of the new
address, updating the edge information in each node. This would solve
the first two issues by switching from a "pull" model to a
"push" model
for propagating a node's real UDP address. It would also solve the
fourth issue by storing received UDP address information in two separate
places (edge_t::address and node_t::address, respectively). In addition,
this mechanism makes more sense (well, in my mind at least) because
edge_t::address is already used to determine UDP addresses anyway.
Furthermore, it would probably work even with old clients because if my
understanding of the code is correct, they already handle edge updates
correctly.
The only regression from the current solution that I can see is that
with the current implementation, intermediate nodes can "participate"
in
determining the node's UDP address by rewriting ANS_KEY messages as they
pass through them. This is not possible in my proposal because that
would require a node to modify an edge that it doesn't own. However, I'm
not sure that it's actually that useful considering the directly
connected nodes usually have the best information about how to reach
their neighbor anyway. I can think of one example where that isn't true,
which would look like this:
A <--> B <---(NAT)---> C <---> D
In this case, B actually holds the best information about how to join D,
not C. Assuming I'm understanding tinc's graph traversal correctly, what
will happen (if my proposal is implemented) if that A will try to reach
D *via C* using UDP address information provided by B - it won't be able
to use D's edge information because it only contains local addresses. So
instead of talking directly to D, A will indirect via C. I don't think
that's much of a concern IMHO because it is only a problem if the link A
<---> D is significantly more preferable to A <---> C <---> D,
which
seems unlikely in such a topology, and can be fixed by directly
connecting D to B.
Thoughts? I get the feeling that Guus already thought about doing it
this way and that there is some fundamental problem that would break my
nice proposal :/
--
Etienne Dechamps