Hello, We're suffering from sporadic network blockage(read: unable to ping other nodes) with 1.1-pre17. Before upgrading to the 1.1-pre release, the same network blockage also manifested itself in a pure 1.0.33 network. The log shows that there are a lot of "Got ADD_EDGE from nodeX (192.168.0.1 port 655) which does not match existing entry" and it turns out that the mismatches were cuased by different weight received by add_edge_h(). This network is consists of ~4 hub nodes and 50+ leaf nodes. Sample hub config: Name = hub1 ConnectTo = hub2 ConnectTo = hub3 ConnectTo = hub4 Leaf looks like: Name = node1 ConnectTo = hub1 ConnectTo = hub2 ConnectTo = hub3 ConnectTo = hub4 Back to the days of pure 1.0.33 nodes, if the network suddenly fails(users will see tincd CPU usage goes 50%+ and unable to get ping response from the other nodes), we can simply shutdown the hub nodes, wait for a few minutes and then restart the hub nodes to get the network back to normal; however, 1.1-pre release seems to autoconnect to non-hub hosts based on the information found in /etc/tinc/hosts, which means that the hub-restarting trick won't work. Additionally, apart from high CPU usage, 1.1-pre tincd also starts hogging memory until Linux OOM kills the process(memory leakage perhaps?). Given that many of our leaf nodes are behind NAT thus there's no direct connection to them expect tinc tunnel, I'm wondering about if there's any way to bring the network back to work without shutting down all nodes? Moreover, is there any better way to pin-point the offending nodes that introduced this symptom? Thanks, A.
We have the same network topology of 6 hub nodes and 30+ leaf nodes. We also suffering from periodic network blockage and very high network usage of ~300mb per node a day (50mb of upload and 250 of download). This high traffic usage happens on idle network. We have 1.0.34 version on our nodes. On Tue, Dec 11, 2018 at 12:52 PM Amit Lianson <lthmod at gmail.com> wrote:> Hello, > We're suffering from sporadic network blockage(read: unable to ping > other nodes) with 1.1-pre17. Before upgrading to the 1.1-pre release, > the same network blockage also manifested itself in a pure 1.0.33 > network. > > The log shows that there are a lot of "Got ADD_EDGE from nodeX > (192.168.0.1 port 655) which does not match existing entry" and it > turns out that the mismatches were cuased by different weight received > by add_edge_h(). > > This network is consists of ~4 hub nodes and 50+ leaf nodes. Sample > hub config: > Name = hub1 > ConnectTo = hub2 > ConnectTo = hub3 > ConnectTo = hub4 > > Leaf looks like: > Name = node1 > ConnectTo = hub1 > ConnectTo = hub2 > ConnectTo = hub3 > ConnectTo = hub4 > > Back to the days of pure 1.0.33 nodes, if the network suddenly > fails(users will see tincd CPU usage goes 50%+ and unable to get ping > response from the other nodes), we can simply shutdown the hub nodes, > wait for a few minutes and then restart the hub nodes to get the > network back to normal; however, 1.1-pre release seems to autoconnect > to non-hub hosts based on the information found in /etc/tinc/hosts, which > means that the hub-restarting trick won't work. Additionally, apart > from high CPU usage, 1.1-pre tincd also starts hogging memory until > Linux OOM kills the process(memory leakage perhaps?). > > Given that many of our leaf nodes are behind NAT thus there's no > direct connection to them expect tinc tunnel, I'm wondering about if > there's any way to bring the network back to work without shutting > down all nodes? Moreover, is there any better way to pin-point the > offending nodes that introduced this symptom? > > Thanks, > A. > _______________________________________________ > tinc mailing list > tinc at tinc-vpn.org > https://www.tinc-vpn.org/cgi-bin/mailman/listinfo/tinc >-- *Vitaly Gorodetsky* Infrastructure Lead Mobile: +972-52-6420530 vgorodetsky at augury.com 39 Haatzmaut St., 1st Floor, Haifa, 3303320, Israel -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://www.tinc-vpn.org/pipermail/tinc/attachments/20181211/8346cf5e/attachment.html>
On Tue, Dec 11, 2018 at 02:36:18PM +0800, Amit Lianson wrote:> We're suffering from sporadic network blockage(read: unable to ping > other nodes) with 1.1-pre17. Before upgrading to the 1.1-pre release, > the same network blockage also manifested itself in a pure 1.0.33 > network. > > The log shows that there are a lot of "Got ADD_EDGE from nodeX > (192.168.0.1 port 655) which does not match existing entry" and it > turns out that the mismatches were cuased by different weight received > by add_edge_h(). > > This network is consists of ~4 hub nodes and 50+ leaf nodes. Sample > hub config:[...] Could you send me the output of "tincctl -n <netname> dump graph"? That would help me to try to reproduce the problem. Also, if you could do "tincctl -n <netname> log 5 >logfile" when the issue occurs, on the node that gives those "Got ADD_EDGE which does not match existing entry" messages, and let it run for a few seconds before stopping the logging, and send me the resulting logfile.> Back to the days of pure 1.0.33 nodes, if the network suddenly > fails(users will see tincd CPU usage goes 50%+ and unable to get ping > response from the other nodes), we can simply shutdown the hub nodes, > wait for a few minutes and then restart the hub nodes to get the > network back to normal; however, 1.1-pre release seems to autoconnect > to non-hub hosts based on the information found in /etc/tinc/hosts, which > means that the hub-restarting trick won't work. Additionally, apart > from high CPU usage, 1.1-pre tincd also starts hogging memory until > Linux OOM kills the process(memory leakage perhaps?).You can disable the autoconnect feature by adding "AutoConnect = no" to tinc.conf, unfortunately you'd have to do that on all nodes. And it doesn't solve the actual problem. If it's hogging memory, that definitely points to a memory leak.> Given that many of our leaf nodes are behind NAT thus there's no > direct connection to them expect tinc tunnel, I'm wondering about if > there's any way to bring the network back to work without shutting > down all nodes? Moreover, is there any better way to pin-point the > offending nodes that introduced this symptom?I hope the output from the "log 5" command will shed some more light on the issue, as it will show which nodes the offending ADD_EDGE belongs to. -- Met vriendelijke groet / with kind regards, Guus Sliepen <guus at tinc-vpn.org> -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 833 bytes Desc: not available URL: <http://www.tinc-vpn.org/pipermail/tinc/attachments/20181218/a73944ba/attachment.sig>
On Wed, Dec 19, 2018 at 12:05 AM Guus Sliepen <guus at tinc-vpn.org> wrote:> > On Tue, Dec 11, 2018 at 02:36:18PM +0800, Amit Lianson wrote: > > > We're suffering from sporadic network blockage(read: unable to ping > > other nodes) with 1.1-pre17. Before upgrading to the 1.1-pre release, > > the same network blockage also manifested itself in a pure 1.0.33 > > network. > > > > The log shows that there are a lot of "Got ADD_EDGE from nodeX > > (192.168.0.1 port 655) which does not match existing entry" and it > > turns out that the mismatches were caused by different weight received > > by add_edge_h(). > > > > This network is consists of ~4 hub nodes and 50+ leaf nodes. Sample > > hub config: > [...] > > Could you send me the output of "tincctl -n <netname> dump graph"? That > would help me to try to reproduce the problem. Also, if you could do > "tincctl -n <netname> log 5 >logfile" when the issue occurs, on the node > that gives those "Got ADD_EDGE which does not match existing entry" > messages, and let it run for a few seconds before stopping the logging, > and send me the resulting logfile.Ouch, I've managed to downgrade most of our nodes to 1.0.35 release since it 'only' hogs the CPU when something goes wrong. Is there any 'tincctl dump graph' alternative in 1.0.35? Would command like 'killall -INT tincd; killall -USR2 tincd' provides enough debugging information?> > Back to the days of pure 1.0.33 nodes, if the network suddenly > > fails(users will see tincd CPU usage goes 50%+ and unable to get ping > > response from the other nodes), we can simply shutdown the hub nodes, > > wait for a few minutes and then restart the hub nodes to get the > > network back to normal; however, 1.1-pre release seems to autoconnect > > to non-hub hosts based on the information found in /etc/tinc/hosts, which > > means that the hub-restarting trick won't work. Additionally, apart > > from high CPU usage, 1.1-pre tincd also starts hogging memory until > > Linux OOM kills the process(memory leakage perhaps?). > > You can disable the autoconnect feature by adding "AutoConnect = no" to > tinc.conf, unfortunately you'd have to do that on all nodes. And itThanks for the hints. BTW, it turns out that 1.1-pre release will automatically keep a copy of received ed25519 key in /etc/tinc/$NET/hosts. Would disabling AutoConnect also disable this key-caching feature? I'm curious about this since some of our deployments have read-only /etc/tinc/$NET/hosts, which could be a problem if key write-back is a necessary in 1.1-pre release.> doesn't solve the actual problem. If it's hogging memory, that > definitely points to a memory leak. > > > Given that many of our leaf nodes are behind NAT thus there's no > > direct connection to them expect tinc tunnel, I'm wondering about if > > there's any way to bring the network back to work without shutting > > down all nodes? Moreover, is there any better way to pin-point the > > offending nodes that introduced this symptom? > > I hope the output from the "log 5" command will shed some more light on > the issue, as it will show which nodes the offending ADD_EDGE belongs > to. > > -- > Met vriendelijke groet / with kind regards, > Guus Sliepen <guus at tinc-vpn.org> > _______________________________________________ > tinc mailing list > tinc at tinc-vpn.org > https://www.tinc-vpn.org/cgi-bin/mailman/listinfo/tinc
Maybe Matching Threads
- subnet flooded with lots of ADD_EDGE request
- tinc 1.1 "Got ADD_EDGE ... which does not match existing entry"
- tinc 1.1 "Got ADD_EDGE ... which does not match existing entry"
- tinc 1.1 "Got ADD_EDGE ... which does not match existing entry"
- purge doesn't remove dead nodes