Hi, I have successfully connected a network of about 60 nodes (many of which are virtual machines) with tinc 1.0 but encounter a severe bug when physical connectivity between two major locations is lost and then reconnected. From what I gathered, many nodes attempt to connect to many other nodes, causing 100% CPU load on all nodes, taking down the whole network with no node succeeding connecting to any node. It seems unable to recover from this state. Luckily I can shutdown and restart most daemons with a few keystrokes, but I have to shutdown all, then start them sequentially and delayed or this "perfect storm" starts all over again. The overall configuration is switch mode, with mixed IPv4 and IPv6 host addressing. Otherwise config is empty with these tweaks added to attempt mitigating the issue (with no success): PingTimeout=15 UDPRcvBuf=8388608 UDPSndBuf=8388608 ProcessPriority=high The daemon was upgraded to vanilla built 1.0.26 on all but two nodes before the most recent event. Host OS is Debian based, ranging from Squeeze to Jessie and few Ubuntu Trusty, with their respective stock kernels. Also, I have tried firewalling the incoming UDP traffic on most nodes, forcing TCP for those connections, to narrow down the problem, but it doesn't seem to change anything. At event time, the logs have these: tincd[1093]: Flushing meta data to server1084 (x.x.x.x port y) failed: Connection reset by peer tincd[1093]: Flushing meta data to server1070 (x.x.x.x port y) failed: Connection reset by peer tincd[1093]: Flushing meta data to server1052 (x.x.x.x port y) failed: Connection reset by peer tincd[1093]: Flushing meta data to server1071 (x.x.x.x port y) failed: Connection reset by peer And these: tincd[1093]: Metadata socket read error for server1076 (x:x:x:x:x:x:x:x port y): Connection reset by peer tincd[1093]: Metadata socket read error for <unknown> (x.x.x.x port y): Connection reset by peer And occasional: tincd[8520]: Old connection_t for server1039 (x.x.x.x port y) status 0010 still lingering, deleting... Any ideas? Regards, Pierre Beck -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://www.tinc-vpn.org/pipermail/tinc/attachments/20151230/44ed233c/attachment.html>
On Wed, Dec 30, 2015 at 05:26:38PM +0000, Pierre Beck wrote:> I have successfully connected a network of about 60 nodes (many of which are virtual machines) with tinc 1.0 but encounter a severe bug when physical connectivity between two major locations is lost and then reconnected. From what I gathered, many nodes attempt to connect to many other nodes, causing 100% CPU load on all nodes, taking down the whole network with no node succeeding connecting to any node. It seems unable to recover from this state. Luckily I can shutdown and restart most daemons with a few keystrokes, but I have to shutdown all, then start them sequentially and delayed or this "perfect storm" starts all over again.60 nodes is not a large number for tinc, it should normally handle this without problems (even on underpowered hardware, like routers running OpenWRT).> The overall configuration is switch mode, with mixed IPv4 and IPv6 host addressing. Otherwise config is empty with these tweaks added to attempt mitigating the issue (with no success): > > PingTimeout=15 > UDPRcvBuf=8388608 > UDPSndBuf=8388608 > ProcessPriority=highApart from PingTimeout, the other options will not help. Increasing PingTimeout may indeed prevent tinc from prematurely closing connections in case of congestion.> Also, I have tried firewalling the incoming UDP traffic on most nodes, forcing TCP for those connections, to narrow down the problem, but it doesn't seem to change anything.If anything, that is counterproductive.> At event time, the logs have these: > tincd[1093]: Flushing meta data to server1084 (x.x.x.x port y) failed: Connection reset by peer > tincd[1093]: Flushing meta data to server1070 (x.x.x.x port y) failed: Connection reset by peer > tincd[1093]: Flushing meta data to server1052 (x.x.x.x port y) failed: Connection reset by peer > tincd[1093]: Flushing meta data to server1071 (x.x.x.x port y) failed: Connection reset by peer > > And these: > tincd[1093]: Metadata socket read error for server1076 (x:x:x:x:x:x:x:x port y): Connection reset by peer > tincd[1093]: Metadata socket read error for <unknown> (x.x.x.x port y): Connection reset by peerThat means the peer decided to close the connection, not the local node, so the logs on those peers might provide more information why they closed the connection. In any case, how is your topology of meta connections (ie, the ones you specify using ConnectTo)? If, on each node, you ConnectTo all other nodes, that will cause tinc to generate a lot of metadata. However, you don't need to do that, only a few ConnectTo statements is usually enough. If you have a few central nodes to which all other nodes ConnectTo, that should work fine as well. -- Met vriendelijke groet / with kind regards, Guus Sliepen <guus at tinc-vpn.org> -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 819 bytes Desc: Digital signature URL: <http://www.tinc-vpn.org/pipermail/tinc/attachments/20151231/b9a88ea8/attachment.sig>
Hi, On 31.12.2015 16:01, Guus Sliepen wrote:> If, on each node, you ConnectTo all other > nodes, that will cause tinc to generate a lot of metadata. However, you > don't need to do that, only a few ConnectTo statements is usually enough. > If you have a few central nodes to which all other nodes ConnectTo, that > should work fine as well.~40 ConnectTo lines. When I reduce the ConnectTo lines to say 5 nodes, will tinc still use the Address= lines to form a full mesh? So when A & B ConnectTo C, A sending data to B will still become direct as long as A or B has an Address= line? Yet it shouldn't crash and burn like that :-) Topology is roughly like this: stack of physical servers (tincd, tincd, tincd, ...) -> virtual servers (more tincd, tincd, tincd, ...) internet Uplink A, NAT for IPv4, some static IPv6 stack of physical servers (tincd, tincd, tincd, ...) -> virtual servers (more tincd, tincd, tincd, ...) internet Uplink B, IPv4 only location, NAT for some IPv4s satellite root servers (tincd) -> virtual servers (tincd, tincd, tincd, ...) internet Uplink C, D, E, ... again some NAT, some not Now imagine Uplink A failing for some time. Then recovering. Many tincds trying to ConnectTo many other tincds. VPN dead. As for logs, I have also found some of these: tinc.vpn-13[4578]: Old connection_t for server1053 (x.x.x.x port y) status 0010 still lingering, deleting... But the crash starts out with connection resets, like this between two nodes: server1070 (virtual server on Uplink B): Dec 30 10:14:52 xxx tinc.vpn-13[4578]: Metadata socket read error for server1073 (x.x.x.x port y): Connection reset by peer server1073 (physical server on Uplink A): Dez 30 10:17:13 xxx tincd[1124]: Flushing meta data to server1070 (x.x.x.x port y) failed: Connection reset by peer And from that point on, almost exclusively the latter random connection resets on all nodes, with some "old connection_t" until daemons are stopped, restarted. I will try reducing the ConnectTo lines to a sane set of highly available, well connected physical servers. Happy new year and thanks for the hint, Pierre Beck