Hi, we use a tinc network of about 400 nodes, all of them linux servers, partly in different datacenters (but generally low latency). Usually this is working very well (for weeks without a problem).>From time to time the whole network goes down though. This happened when werestarted a larger number of servers or when there was a connectivity issue between datacenters or some (short) maintenance on the network infrastructure. The problem was already described in the mailing list (for example here: https://www.tinc-vpn.org/pipermail/tinc/2015-December/004325.html , we see the same messages in our logs as described there). We try to avoid situations where a large number of servers becomes unavailable but from time to time it just happens. For us it would be important that tinc continues working with the hosts that are still reachable and that it recovers itself and we do not have to stop and start the whole network manually. We already tried to tweak the configuration to limit the amount of metadata by only having 3 ConnectTo hosts (the same ones everywhere) and using Broadcast = no DirectOnly = yes Cipher=aes-128-cbc (Apart from Name, AddressFamily, BindToAddress, Interface and ConnectTo that are the only settings we use in tinc.conf). We are also going to increase PingTimeout to 30 and reduce the number of ConnectTo hosts to 2. Is there anything else we can do to limit the amount of metadata (as that seems to be reason why tinc just stops working and only produces log messages about failed connection attempts)? Ideally we would not need any metadata updates at all (apart from key updates) since each host can connect to every other host and all the host config files are available everywhere locally. We also thought about using TunnelServer = yes, would this help? Does it make sense to somehow group ConnectTo hosts (so use two ConnectTo servers for one host group, another two for another host group and let the ConnectTo servers connect to each other)? Thank you for any help with this! Hendrik -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://www.tinc-vpn.org/pipermail/tinc/attachments/20160621/61555640/attachment.html>
On Tue, Jun 21, 2016 at 01:04:31PM +0200, Hendrik Schumacher wrote:> From time to time the whole network goes down though. This happened when we > restarted a larger number of servers or when there was a connectivity issue > between datacenters or some (short) maintenance on the network > infrastructure. The problem was already described in the mailing list (for > example here: > https://www.tinc-vpn.org/pipermail/tinc/2015-December/004325.html , we see > the same messages in our logs as described there).[...]> We already tried to tweak the configuration to limit the amount of metadata > by only having 3 ConnectTo hosts (the same ones everywhere) and using > > Broadcast = no > DirectOnly = yes > Cipher=aes-128-cbcThese options do not directly affect metadata. In particular, "DirectOnly = yes" may actually cause nodes to be less reachable than without that option.> We are also going to increase PingTimeout to 30 and reduce the number of > ConnectTo hosts to 2.Increasing PingTimeout will probably help. As for the ConnectTo hosts: reducing the number will also reduce the amount of metadata traffic proprotionally. However, in your case, with 400 nodes connection to the same 3 central nodes, you might have to look at the amount of metadata that each node handles. It would be better to have more central nodes, and have leaf nodes only connect to a few of them.> Is there anything else we can do to limit the amount of metadata (as that > seems to be reason why tinc just stops working and only produces log > messages about failed connection attempts)?Not really.> Ideally we would not need any metadata updates at all (apart from key > updates) since each host can connect to every other host and all the host > config files are available everywhere locally.In that case, you might want to have a look at the tinc 1.1 prerelease, remove the ConnectTo's and enable the AutoConnect feature. This will let tinc make metaconnections automatically in a more distributed way. It will also switch metaconnections to different nodes in case the ones it is connecting to fail.> We also thought about using TunnelServer = yes, would this help?That might help, but then you lose most of the peer-to-peer connectivity. The reason is that tinc nodes actually only look at their host config files when making metaconnections. With TunnelServer = yes, nodes will only learn about those servers, but don't learn about all the other nodes, and then they will act like they don't exist.> Does it make sense to somehow group ConnectTo hosts (so use two > ConnectTo servers for one host group, another two for another host > group and let the ConnectTo servers connect to each other)?That will probably help. -- Met vriendelijke groet / with kind regards, Guus Sliepen <guus at tinc-vpn.org> -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 819 bytes Desc: Digital signature URL: <http://www.tinc-vpn.org/pipermail/tinc/attachments/20160621/bec019a7/attachment.sig>
Thank you for the helpful advice. We will try to group the servers with different ConnectTo servers first. If this does not help we will look at the TunnelServer solution. Just to make sure we understand TunnelServer correctly: do you need to specify every host as ConnectTo that the host should be able to communicate with or is it sufficient to just provide the hosts files? Thanks, Hendrik 2016-06-21 14:35 GMT+02:00 Guus Sliepen <guus at tinc-vpn.org>:> On Tue, Jun 21, 2016 at 01:04:31PM +0200, Hendrik Schumacher wrote: > > > From time to time the whole network goes down though. This happened when > we > > restarted a larger number of servers or when there was a connectivity > issue > > between datacenters or some (short) maintenance on the network > > infrastructure. The problem was already described in the mailing list > (for > > example here: > > https://www.tinc-vpn.org/pipermail/tinc/2015-December/004325.html , we > see > > the same messages in our logs as described there). > [...] > > We already tried to tweak the configuration to limit the amount of > metadata > > by only having 3 ConnectTo hosts (the same ones everywhere) and using > > > > Broadcast = no > > DirectOnly = yes > > Cipher=aes-128-cbc > > These options do not directly affect metadata. In particular, > "DirectOnly = yes" may actually cause nodes to be less reachable than > without that option. > > > We are also going to increase PingTimeout to 30 and reduce the number of > > ConnectTo hosts to 2. > > Increasing PingTimeout will probably help. As for the ConnectTo hosts: > reducing the number will also reduce the amount of metadata traffic > proprotionally. However, in your case, with 400 nodes connection to the > same 3 central nodes, you might have to look at the amount of metadata > that each node handles. It would be better to have more central nodes, > and have leaf nodes only connect to a few of them. > > > Is there anything else we can do to limit the amount of metadata (as that > > seems to be reason why tinc just stops working and only produces log > > messages about failed connection attempts)? > > Not really. > > > Ideally we would not need any metadata updates at all (apart from key > > updates) since each host can connect to every other host and all the host > > config files are available everywhere locally. > > In that case, you might want to have a look at the tinc 1.1 prerelease, > remove the ConnectTo's and enable the AutoConnect feature. This will let > tinc make metaconnections automatically in a more distributed way. It > will also switch metaconnections to different nodes in case the ones it > is connecting to fail. > > > We also thought about using TunnelServer = yes, would this help? > > That might help, but then you lose most of the peer-to-peer > connectivity. The reason is that tinc nodes actually only look at their > host config files when making metaconnections. With TunnelServer = yes, > nodes will only learn about those servers, but don't learn about all the > other nodes, and then they will act like they don't exist. > > > Does it make sense to somehow group ConnectTo hosts (so use two > > ConnectTo servers for one host group, another two for another host > > group and let the ConnectTo servers connect to each other)? > > That will probably help. > > -- > Met vriendelijke groet / with kind regards, > Guus Sliepen <guus at tinc-vpn.org> > > _______________________________________________ > tinc mailing list > tinc at tinc-vpn.org > https://www.tinc-vpn.org/cgi-bin/mailman/listinfo/tinc > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://www.tinc-vpn.org/pipermail/tinc/attachments/20160622/0e0214fe/attachment.html>