thr3ads.net - tinc - Self-DoS [Dec 2015]

If this information is useful, please help other people find it:
Share via:

Pierre Beck

2015-Dec-30 17:26 UTC

Self-DoS

Hi,

I have successfully connected a network of about 60 nodes (many of which are
virtual machines) with tinc 1.0 but encounter a severe bug when physical
connectivity between two major locations is lost and then reconnected. From what
I gathered, many nodes attempt to connect to many other nodes, causing 100% CPU
load on all nodes, taking down the whole network with no node succeeding
connecting to any node. It seems unable to recover from this state. Luckily I
can shutdown and restart most daemons with a few keystrokes, but I have to
shutdown all, then start them sequentially and delayed or this "perfect
storm" starts all over again.

The overall configuration is switch mode, with mixed IPv4 and IPv6 host
addressing. Otherwise config is empty with these tweaks added to attempt
mitigating the issue (with no success):

PingTimeout=15
UDPRcvBuf=8388608
UDPSndBuf=8388608
ProcessPriority=high

The daemon was upgraded to vanilla built 1.0.26 on all but two nodes before the
most recent event. Host OS is Debian based, ranging from Squeeze to Jessie and
few Ubuntu Trusty, with their respective stock kernels.

Also, I have tried firewalling the incoming UDP traffic on most nodes, forcing
TCP for those connections, to narrow down the problem, but it doesn't seem
to change anything.

At event time, the logs have these:
tincd[1093]: Flushing meta data to server1084 (x.x.x.x port y) failed:
Connection reset by peer
tincd[1093]: Flushing meta data to server1070 (x.x.x.x port y) failed:
Connection reset by peer
tincd[1093]: Flushing meta data to server1052 (x.x.x.x port y) failed:
Connection reset by peer
tincd[1093]: Flushing meta data to server1071 (x.x.x.x port y) failed:
Connection reset by peer

And these:
tincd[1093]: Metadata socket read error for server1076 (x:x:x:x:x:x:x:x port y):
Connection reset by peer
tincd[1093]: Metadata socket read error for <unknown> (x.x.x.x port y):
Connection reset by peer

And occasional:
tincd[8520]: Old connection_t for server1039 (x.x.x.x port y) status 0010 still
lingering, deleting...

Any ideas?

Regards,

Pierre Beck
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://www.tinc-vpn.org/pipermail/tinc/attachments/20151230/44ed233c/attachment.html>

Guus Sliepen

2015-Dec-31 15:01 UTC

head link

Self-DoS

On Wed, Dec 30, 2015 at 05:26:38PM +0000, Pierre Beck wrote:
> I have successfully connected a network of about 60 nodes (many of which
are virtual machines) with tinc 1.0 but encounter a severe bug when physical
connectivity between two major locations is lost and then reconnected. From what
I gathered, many nodes attempt to connect to many other nodes, causing 100% CPU
load on all nodes, taking down the whole network with no node succeeding
connecting to any node. It seems unable to recover from this state. Luckily I
can shutdown and restart most daemons with a few keystrokes, but I have to
shutdown all, then start them sequentially and delayed or this "perfect
storm" starts all over again.
60 nodes is not a large number for tinc, it should normally handle this
without problems (even on underpowered hardware, like routers running
OpenWRT).
> The overall configuration is switch mode, with mixed IPv4 and IPv6 host
addressing. Otherwise config is empty with these tweaks added to attempt
mitigating the issue (with no success):
> 
> PingTimeout=15
> UDPRcvBuf=8388608
> UDPSndBuf=8388608
> ProcessPriority=high
Apart from PingTimeout, the other options will not help. Increasing
PingTimeout may indeed prevent tinc from prematurely closing
connections in case of congestion.
> Also, I have tried firewalling the incoming UDP traffic on most nodes,
forcing TCP for those connections, to narrow down the problem, but it
doesn't seem to change anything.
If anything, that is counterproductive.
> At event time, the logs have these:
> tincd[1093]: Flushing meta data to server1084 (x.x.x.x port y) failed:
Connection reset by peer
> tincd[1093]: Flushing meta data to server1070 (x.x.x.x port y) failed:
Connection reset by peer
> tincd[1093]: Flushing meta data to server1052 (x.x.x.x port y) failed:
Connection reset by peer
> tincd[1093]: Flushing meta data to server1071 (x.x.x.x port y) failed:
Connection reset by peer
> 
> And these:
> tincd[1093]: Metadata socket read error for server1076 (x:x:x:x:x:x:x:x
port y): Connection reset by peer
> tincd[1093]: Metadata socket read error for <unknown> (x.x.x.x port
y): Connection reset by peer
That means the peer decided to close the connection, not the local node,
so the logs on those peers might provide more information why they
closed the connection.

In any case, how is your topology of meta connections (ie, the ones you
specify using ConnectTo)? If, on each node, you ConnectTo all other
nodes, that will cause tinc to generate a lot of metadata. However, you
don't need to do that, only a few ConnectTo statements is usually enough.
If you have a few central nodes to which all other nodes ConnectTo, that
should work fine as well.

-- 
Met vriendelijke groet / with kind regards,
     Guus Sliepen <guus at tinc-vpn.org>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 819 bytes
Desc: Digital signature
URL:
<http://www.tinc-vpn.org/pipermail/tinc/attachments/20151231/b9a88ea8/attachment.sig>

Pierre Beck

2016-Jan-02 17:00 UTC

head link

Self-DoS

Hi,

On 31.12.2015 16:01, Guus Sliepen wrote:> If, on each node, you ConnectTo all other
> nodes, that will cause tinc to generate a lot of metadata. However, you
> don't need to do that, only a few ConnectTo statements is usually
enough.
> If you have a few central nodes to which all other nodes ConnectTo, that
> should work fine as well.~40 ConnectTo lines. When I reduce the ConnectTo lines to say 5 nodes, 
will tinc still use the Address= lines to form a full mesh? So when A & 
B ConnectTo C, A sending data to B will still become direct as long as A 
or B has an Address= line?

Yet it shouldn't crash and burn like that :-)

Topology is roughly like this:

stack of physical servers (tincd, tincd, tincd, ...)
-> virtual servers (more tincd, tincd, tincd, ...)
internet Uplink A, NAT for IPv4, some static IPv6

stack of physical servers (tincd, tincd, tincd, ...)
-> virtual servers (more tincd, tincd, tincd, ...)
internet Uplink B, IPv4 only location, NAT for some IPv4s

satellite root servers (tincd)
-> virtual servers (tincd, tincd, tincd, ...)
internet Uplink C, D, E, ... again some NAT, some not

Now imagine Uplink A failing for some time. Then recovering. Many tincds 
trying to ConnectTo many other tincds. VPN dead.

As for logs, I have also found some of these:
tinc.vpn-13[4578]: Old connection_t for server1053 (x.x.x.x port y) 
status 0010 still lingering, deleting...

But the crash starts out with connection resets, like this between two 
nodes:

server1070 (virtual server on Uplink B):
Dec 30 10:14:52 xxx tinc.vpn-13[4578]: Metadata socket read error for 
server1073 (x.x.x.x port y): Connection reset by peer

server1073 (physical server on Uplink A):
Dez 30 10:17:13 xxx tincd[1124]: Flushing meta data to server1070 
(x.x.x.x port y) failed: Connection reset by peer

And from that point on, almost exclusively the latter random connection 
resets on all nodes, with some "old connection_t" until daemons are 
stopped, restarted.

I will try reducing the ConnectTo lines to a sane set of highly 
available, well connected physical servers.

Happy new year and thanks for the hint,

Pierre Beck

Seemingly Similar Threads

Search for more apparently analagous threads

tinc - Dec 2015 - Self-DoS

Self-DoS

Self-DoS

Self-DoS

Seemingly Similar Threads