thr3ads.net - tinc - subnet flooded with lots of ADD

If this information is useful, please help other people find it:
Share via:

Amit Lianson

2018-Dec-11 06:36 UTC

subnet flooded with lots of ADD_EDGE request

Hello,
  We're suffering from sporadic network blockage(read: unable to ping
other nodes) with 1.1-pre17.  Before upgrading to the 1.1-pre release,
the same network blockage also manifested itself in a pure 1.0.33
network.

  The log shows that there are a lot of "Got ADD_EDGE from nodeX
(192.168.0.1 port 655) which does not match existing entry" and it
turns out that the mismatches were cuased by different weight received
by add_edge_h().

  This network is consists of ~4 hub nodes and 50+ leaf nodes.  Sample
hub config:
  Name = hub1
  ConnectTo = hub2
  ConnectTo = hub3
  ConnectTo = hub4

  Leaf looks like:
   Name = node1
   ConnectTo = hub1
   ConnectTo = hub2
   ConnectTo = hub3
   ConnectTo = hub4

  Back to the days of pure 1.0.33 nodes, if the network suddenly
fails(users will see tincd CPU usage goes 50%+ and unable to get ping
response from the other nodes), we can simply shutdown the hub nodes,
wait for a few minutes and then restart the hub nodes to get the
network back to normal; however, 1.1-pre release seems to autoconnect
to non-hub hosts based on the information found in /etc/tinc/hosts, which
means that the hub-restarting trick won't work.  Additionally, apart
from high CPU usage, 1.1-pre tincd also starts hogging memory until
Linux OOM kills the process(memory leakage perhaps?).

   Given that many of our leaf nodes are behind NAT thus there's no
direct connection to them expect tinc tunnel, I'm wondering about if
there's any way to bring the network back to work without shutting
down all nodes?  Moreover, is there any better way to pin-point the
offending nodes that introduced this symptom?

Thanks,
A.

Vitaly Gorodetsky

2018-Dec-11 14:34 UTC

head link

subnet flooded with lots of ADD_EDGE request

We have the same network topology of 6 hub nodes and 30+ leaf nodes.
We also suffering from periodic network blockage and very high network
usage of ~300mb per node a day (50mb of upload and 250 of download).
This high traffic usage happens on idle network.
We have 1.0.34 version on our nodes.

On Tue, Dec 11, 2018 at 12:52 PM Amit Lianson <lthmod at gmail.com> wrote:
> Hello,
>   We're suffering from sporadic network blockage(read: unable to ping
> other nodes) with 1.1-pre17.  Before upgrading to the 1.1-pre release,
> the same network blockage also manifested itself in a pure 1.0.33
> network.
>
>   The log shows that there are a lot of "Got ADD_EDGE from nodeX
> (192.168.0.1 port 655) which does not match existing entry" and it
> turns out that the mismatches were cuased by different weight received
> by add_edge_h().
>
>   This network is consists of ~4 hub nodes and 50+ leaf nodes.  Sample
> hub config:
>   Name = hub1
>   ConnectTo = hub2
>   ConnectTo = hub3
>   ConnectTo = hub4
>
>   Leaf looks like:
>    Name = node1
>    ConnectTo = hub1
>    ConnectTo = hub2
>    ConnectTo = hub3
>    ConnectTo = hub4
>
>   Back to the days of pure 1.0.33 nodes, if the network suddenly
> fails(users will see tincd CPU usage goes 50%+ and unable to get ping
> response from the other nodes), we can simply shutdown the hub nodes,
> wait for a few minutes and then restart the hub nodes to get the
> network back to normal; however, 1.1-pre release seems to autoconnect
> to non-hub hosts based on the information found in /etc/tinc/hosts, which
> means that the hub-restarting trick won't work.  Additionally, apart
> from high CPU usage, 1.1-pre tincd also starts hogging memory until
> Linux OOM kills the process(memory leakage perhaps?).
>
>    Given that many of our leaf nodes are behind NAT thus there's no
> direct connection to them expect tinc tunnel, I'm wondering about if
> there's any way to bring the network back to work without shutting
> down all nodes?  Moreover, is there any better way to pin-point the
> offending nodes that introduced this symptom?
>
> Thanks,
> A.
> _______________________________________________
> tinc mailing list
> tinc at tinc-vpn.org
> https://www.tinc-vpn.org/cgi-bin/mailman/listinfo/tinc
>

-- 
*Vitaly Gorodetsky*
Infrastructure Lead

Mobile: +972-52-6420530
vgorodetsky at augury.com

39 Haatzmaut St., 1st Floor,
Haifa, 3303320, Israel
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://www.tinc-vpn.org/pipermail/tinc/attachments/20181211/8346cf5e/attachment.html>

Guus Sliepen

2018-Dec-18 16:05 UTC

head link

subnet flooded with lots of ADD_EDGE request

On Tue, Dec 11, 2018 at 02:36:18PM +0800, Amit Lianson wrote:
>   We're suffering from sporadic network blockage(read: unable to ping
> other nodes) with 1.1-pre17.  Before upgrading to the 1.1-pre release,
> the same network blockage also manifested itself in a pure 1.0.33
> network.
> 
>   The log shows that there are a lot of "Got ADD_EDGE from nodeX
> (192.168.0.1 port 655) which does not match existing entry" and it
> turns out that the mismatches were cuased by different weight received
> by add_edge_h().
> 
>   This network is consists of ~4 hub nodes and 50+ leaf nodes.  Sample
> hub config:[...]

Could you send me the output of "tincctl -n <netname> dump
graph"? That
would help me to try to reproduce the problem. Also, if you could do
"tincctl -n <netname> log 5 >logfile" when the issue occurs,
on the node
that gives those "Got ADD_EDGE which does not match existing entry"
messages, and let it run for a few seconds before stopping the logging,
and send me the resulting logfile.
>   Back to the days of pure 1.0.33 nodes, if the network suddenly
> fails(users will see tincd CPU usage goes 50%+ and unable to get ping
> response from the other nodes), we can simply shutdown the hub nodes,
> wait for a few minutes and then restart the hub nodes to get the
> network back to normal; however, 1.1-pre release seems to autoconnect
> to non-hub hosts based on the information found in /etc/tinc/hosts, which
> means that the hub-restarting trick won't work.  Additionally, apart
> from high CPU usage, 1.1-pre tincd also starts hogging memory until
> Linux OOM kills the process(memory leakage perhaps?).
You can disable the autoconnect feature by adding "AutoConnect = no"
to
tinc.conf, unfortunately you'd have to do that on all nodes. And it
doesn't solve the actual problem. If it's hogging memory, that
definitely points to a memory leak.
>    Given that many of our leaf nodes are behind NAT thus there's no
> direct connection to them expect tinc tunnel, I'm wondering about if
> there's any way to bring the network back to work without shutting
> down all nodes?  Moreover, is there any better way to pin-point the
> offending nodes that introduced this symptom?
I hope the output from the "log 5" command will shed some more light
on
the issue, as it will show which nodes the offending ADD_EDGE belongs
to.

-- 
Met vriendelijke groet / with kind regards,
     Guus Sliepen <guus at tinc-vpn.org>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: not available
URL:
<http://www.tinc-vpn.org/pipermail/tinc/attachments/20181218/a73944ba/attachment.sig>

Amit Lianson

2019-Jan-08 04:00 UTC

head link

subnet flooded with lots of ADD_EDGE request

On Wed, Dec 19, 2018 at 12:05 AM Guus Sliepen <guus at tinc-vpn.org>
wrote:>
> On Tue, Dec 11, 2018 at 02:36:18PM +0800, Amit Lianson wrote:
>
> >   We're suffering from sporadic network blockage(read: unable to
ping
> > other nodes) with 1.1-pre17.  Before upgrading to the 1.1-pre release,
> > the same network blockage also manifested itself in a pure 1.0.33
> > network.
> >
> >   The log shows that there are a lot of "Got ADD_EDGE from nodeX
> > (192.168.0.1 port 655) which does not match existing entry" and
it
> > turns out that the mismatches were caused by different weight received
> > by add_edge_h().
> >
> >   This network is consists of ~4 hub nodes and 50+ leaf nodes.  Sample
> > hub config:
> [...]
>
> Could you send me the output of "tincctl -n <netname> dump
graph"? That
> would help me to try to reproduce the problem. Also, if you could do
> "tincctl -n <netname> log 5 >logfile" when the issue
occurs, on the node
> that gives those "Got ADD_EDGE which does not match existing
entry"
> messages, and let it run for a few seconds before stopping the logging,
> and send me the resulting logfile.
  Ouch, I've managed to downgrade most of our nodes to 1.0.35 release
since it 'only' hogs
the CPU when something goes wrong.

  Is there any 'tincctl dump graph' alternative in 1.0.35?  Would
command like
'killall -INT tincd; killall -USR2 tincd' provides enough debugging
information?
> >   Back to the days of pure 1.0.33 nodes, if the network suddenly
> > fails(users will see tincd CPU usage goes 50%+ and unable to get ping
> > response from the other nodes), we can simply shutdown the hub nodes,
> > wait for a few minutes and then restart the hub nodes to get the
> > network back to normal; however, 1.1-pre release seems to autoconnect
> > to non-hub hosts based on the information found in /etc/tinc/hosts,
which
> > means that the hub-restarting trick won't work.  Additionally,
apart
> > from high CPU usage, 1.1-pre tincd also starts hogging memory until
> > Linux OOM kills the process(memory leakage perhaps?).
>
> You can disable the autoconnect feature by adding "AutoConnect =
no" to
> tinc.conf, unfortunately you'd have to do that on all nodes. And it
  Thanks for the hints.  BTW, it turns out that 1.1-pre release will
automatically
keep a copy of received ed25519 key in /etc/tinc/$NET/hosts.  Would
disabling AutoConnect also disable this key-caching feature?

  I'm curious about this since some of our deployments have read-only
/etc/tinc/$NET/hosts,
which could be a problem if key write-back is a necessary in 1.1-pre release.
> doesn't solve the actual problem. If it's hogging memory, that
> definitely points to a memory leak.
>
> >    Given that many of our leaf nodes are behind NAT thus there's
no
> > direct connection to them expect tinc tunnel, I'm wondering about
if
> > there's any way to bring the network back to work without shutting
> > down all nodes?  Moreover, is there any better way to pin-point the
> > offending nodes that introduced this symptom?
>
> I hope the output from the "log 5" command will shed some more
light on
> the issue, as it will show which nodes the offending ADD_EDGE belongs
> to.
>
> --
> Met vriendelijke groet / with kind regards,
>      Guus Sliepen <guus at tinc-vpn.org>
> _______________________________________________
> tinc mailing list
> tinc at tinc-vpn.org
> https://www.tinc-vpn.org/cgi-bin/mailman/listinfo/tinc

Possibly Parallel Threads

Search for more maybe matching threads

tinc - Dec 2018 - subnet flooded with lots of ADD_EDGE request

subnet flooded with lots of ADD_EDGE request

subnet flooded with lots of ADD_EDGE request

subnet flooded with lots of ADD_EDGE request

subnet flooded with lots of ADD_EDGE request

Possibly Parallel Threads