thr3ads.net - tinc devel - Packet loss with LocalDiscovery [Jul 2013]

If this information is useful, please help other people find it:
Share via:

Etienne Dechamps

2013-Jul-15 18:45 UTC

Packet loss with LocalDiscovery

Hi,

I believe I have found a bug with regard to the LocalDiscovery feature. 
This is on tinc-1.1pre7 between two Windows nodes.

Steps to reproduce:
- Get two nodes talking using LocalDiscovery (e.g. put them on the same 
LAN behind a NAT with no metaconnection to each other)
- Make one ping the other.

Expected result:
- The two nodes should ping each other without any packet loss, 
hopefully at LAN latencies.

Actual result:
- I'm experiencing packet loss every (PingInterval) seconds. Each packet 
loss "episode" lasts roughly 1 second and during that time all packets
are lost. Apparently it happens each time the nodes are exchanging PMTU 
probes. Packet loss correlates with "Sending/Got MTU probe" messages.
It
also materializes in the form of "Received UDP packet from unknown 
source <LAN Address>" messages.
- There seems to be some "flapping" with regard to the local host 
discovery itself, meaning that it sometimes reverts to the "normal"
mode
of communication for a brief time for no reason. This can be seen as 
elevated latencies.

For aggressive PingInterval values (e.g. 3 seconds) this makes the link 
between the two nodes basically unusable with 30%+ packet loss.

My hypothesis is that during PMTU discovery tinc "forgets" about the 
other node's locally discovered address, which results in packet loss 
because it doesn't recognize packets coming from the local address 
anymore and makes it revert to "classic mode" for a brief time. Then 
after a moment local discovery kicks in again and fixes the situation, 
until PMTU discovery happens again, and so on.

I will continue investigating and try to come up with a fix.

-- 
Etienne Dechamps

Etienne Dechamps

2013-Jul-20 14:48 UTC

head link

Packet loss with LocalDiscovery

Good news: I have found the root cause for the bug, and came up with a fix.

Surprisingly, this is caused by a bug in hash_function(). Because of a 
small mistake in the inner loop, the function will only ever use the 
first 4 bytes of data, and will never look at the remaining bytes.

Incidentally, the node UDP address cache uses this function with a key 
of type sockaddr_in. This structure contains the IP protocol and port in 
the first 4 bytes, and the IP address in the remaining bytes. With the 
broken hash function, this means that all addresses that have the same 
IP protocol and port will return the same hash.

This is demonstrated by the following test code:
https://github.com/dechamps/tinc/compare/1828908...hashtest

Which outputs the following:
     Hash test: 10.172.1.1:656 (10.172.1.1 port 656) -> 144
     Hash test: 10.172.1.5:656 (10.172.1.5 port 656) -> 144
     Hash test: 94.23.24.223:656 (94.23.24.223 port 656) -> 144
     Hash test: 127.0.0.1:656 (127.0.0.1 port 656) -> 144
     Hash test: 1.2.3.4:656 (1.2.3.4 port 656) -> 144
     Hash test: 10.172.1.1:655 (10.172.1.1 port 655) -> 104
     Hash test: 10.172.1.5:655 (10.172.1.5 port 655) -> 104
     Hash test: 94.23.24.223:655 (94.23.24.223 port 655) -> 104
     Hash test: 127.0.0.1:655 (127.0.0.1 port 655) -> 104
     Hash test: 1.2.3.4:655 (1.2.3.4 port 655) -> 104
     Hash test: 10.172.1.1:42 (10.172.1.1 port 42) -> 80
     Hash test: 10.172.1.5:42 (10.172.1.5 port 42) -> 80
     Hash test: 94.23.24.223:42 (94.23.24.223 port 42) -> 80
     Hash test: 127.0.0.1:42 (127.0.0.1 port 42) -> 80
     Hash test: 1.2.3.4:42 (1.2.3.4 port 42) -> 80

To make things worse, tinc's hash table doesn't care about collisions, 
meaning, two keys with the same hash value will override each other. 
This means that if two nodes happen to use the same port number, they 
can't appear in the node UDP address cache at the same time.

Unfortunately, that matches the situation that I put myself in:
- Node "kobol" sits somewhere on the Internet, IP address 94.23.24.223
- Node "zyklos" is on my LAN, local IP address 10.172.1.1
- Node "firefly" is on my LAN, local IP address 10.172.1.5

All these nodes use port 656. Now, here's what happens when 
LocalDiscovery comes into play:

1. LocalDiscovery succeeds, zyklos and firefly are speaking locally, and 
their UDP address cache contains each other's local addresses:
     [144 (2148532224 % 256 = 144)] 10.172.1.5 port 656 -> firefly

2. At some point, kobol will attempt to communicate with one of the two 
nodes. This could happen because they need to negotiate PMTU again, 
which happens every PingInterval seconds.

3. If communication succeeds over UDP, kobol's address will override the 
existing contents of the node's UDP address cache:
     [144 (2148532224 % 256 = 144)] 94.23.24.223 port 656 -> kobol

4. As a result, when the node receives data from the other local node, 
it will not find its address in the UDP address cache and will have to 
fall back to try_harder() to find out from which node it just received data.

5. If one of the nodes starts talking simultaneously with kobol and the 
other local node (e.g. in my test, the two local nodes are pinging each 
other while kobol is renegociating PMTU at the same time), then they 
will constantly override each other in the cache, and try_harder() will 
be called several times per second.

6. Unfortunately, there is what seems to be a "safety" (Oracle 
protection) measure in try_harder() that will make it refuse to "try 
harder" more than one time per second, dropping packets instead.

In summary: when a node is communicating with more than one other node 
simultaneously, the UDP address cache is constantly being overridden, 
try_harder() is called lots of time per second, and as a result, 
throughput is limited to at most one packet per second, which is of 
course completely impractical.

Here's the fix: https://github.com/gsliepen/tinc/pull/5

On 15/07/2013 19:45, Etienne Dechamps wrote:> Hi,
>
> I believe I have found a bug with regard to the LocalDiscovery feature.
> This is on tinc-1.1pre7 between two Windows nodes.
>
> Steps to reproduce:
> - Get two nodes talking using LocalDiscovery (e.g. put them on the same
> LAN behind a NAT with no metaconnection to each other)
> - Make one ping the other.
>
> Expected result:
> - The two nodes should ping each other without any packet loss,
> hopefully at LAN latencies.
>
> Actual result:
> - I'm experiencing packet loss every (PingInterval) seconds. Each
packet
> loss "episode" lasts roughly 1 second and during that time all
packets
> are lost. Apparently it happens each time the nodes are exchanging PMTU
> probes. Packet loss correlates with "Sending/Got MTU probe"
messages. It
> also materializes in the form of "Received UDP packet from unknown
> source <LAN Address>" messages.
> - There seems to be some "flapping" with regard to the local host
> discovery itself, meaning that it sometimes reverts to the
"normal" mode
> of communication for a brief time for no reason. This can be seen as
> elevated latencies.
>
> For aggressive PingInterval values (e.g. 3 seconds) this makes the link
> between the two nodes basically unusable with 30%+ packet loss.
>
> My hypothesis is that during PMTU discovery tinc "forgets" about
the
> other node's locally discovered address, which results in packet loss
> because it doesn't recognize packets coming from the local address
> anymore and makes it revert to "classic mode" for a brief time.
Then
> after a moment local discovery kicks in again and fixes the situation,
> until PMTU discovery happens again, and so on.
>
> I will continue investigating and try to come up with a fix.

-- 
Etienne Dechamps

Possibly Parallel Threads

Search for more apparently analagous threads

tinc devel - Jul 2013 - Packet loss with LocalDiscovery

Packet loss with LocalDiscovery

Packet loss with LocalDiscovery

Possibly Parallel Threads