tinc devel - Jul 2020 - SegFault when using TunnelServer=yes

Hi, thank for getting back.

I'll answer the questions, but I've already gave up on tinc and switch 
to zerotier-one.

On 2020-07-27 5:10 p.m., borg at uu3.net wrote:> Hi. I have few questions out of curiosity.. Cant help for now with
> your problem...
>
> What version is crashing? 1.1 or 1.0 ?
1.1 is crashing>
> How your network is segmented..?
> I use tinc myself here a lot too (1.0) but my network is very segmented.
> I use switch mode and handle routing myself, so mesh links arent large..
>
> I would NOT go beyond 30 nodes for full auto-mesh.. its already like 435
> edges...Well it is not. It used to be switch mode before (like may be a year 
back) and I used dnsmasq to do DHCP on the nodes.

However I've switched to router mode with static ips which reduces the 
traffic significantly and helped for a time.

I think the problem is that the edges get like 2500+ and then when the 
central node crashes and restarts all the nodes are trying to reconnect. 
Once reconnected every nodes gives back those 2500+ edges to the central 
node which in turns tries to process them and forward them back to every 
connection already made.
Since tinc is single thread that processing start to eat up the CPU and 
the nodes start to beleave the connection is dead and reconnect again 
which in turns starts the hole process itself.

In my opinion this is a design flaw in tinc. The notion to every node to 
know about every other nodes just limits how many nodes can be handled.

In my case may be the situation could be mitigated with TunnelServer, 
but that leads to the crash, and further more would make for the other 
nodes to not be able to connect to each other.

I think a better approach would be for the nodes to exchange information 
only when a link is to be established (something like arp). Like if node 
A want to contact node C but has a connection only to B and N to ask 
them if they know something about C, if they don't, then they in turn 
could ask their connections and so on.

Anyway, since I've switch to zerotier I have no problems and so far it 
works great.

Best Regards
>
> Regards,
> Borg
>
>
> ---------- Original message ----------
>
> From: Anton Avramov <SRS0=TSOC=AB=lukav.com=lukav at mijnuvt.nl>
> To: tinc-devel at tinc-vpn.org
> Subject: SegFault when using TunnelServer=yes
> Date: Fri, 19 Jun 2020 12:22:36 -0400
>
> Hi all,
>
> I have a network with about ~800. The network is a mix of tinc 1.0 and
> 1.1 nodes. It is gradually expanding for several years now.
>
> The problem is that at some point it seams the daemon can not handle the
> processing of the new connection and the edges.
>
> There are 3 major nodes in the system and every other node initially
> makes connection to one of them.
>
> Now after a lot of debugging I've limited to all nodes to connect only
> to one node, and use iptables to grant new connections gradually. last
> limit was 5 per minute.
>
> I've started to monitor how the edges are growing on the main node and
I
> see that although I've limited the connections on the other 2 major
> nodes at some point there are rapid spikes in the edges when new
> connection is established.
> So my guess is that the other nodes have a previous state on the edges
> when they try to push it, that is causing the main nodes to become
> overwhelmed.
>
> So I've decided to put TunnelServer=yes on the major nodes so they
don't
> propagate the connections on the other nodes.
>
> However I get a segfault soon after starting on each node that I enable
> that option.
>
> I've build from the latest code and here is a trace of such a run:
(this
> is not from a "major" node, but the effect is the same)
>
> Got ANS_KEY from Backbone (164.138.216.106 port 655): 16 Office
> Lukav_Beast
>
52201D7CFDC2C7E1FD7871A36E651B7AC24A52B4ED892CD953397F6BA859AB22D5D4CB235B9CF85910B6BDE91A34C85E
> 427 672 4 0 94.155.19.130 13935
> Using reflexive UDP address from Office: 94.155.19.130 port 13935
> UDP address of Office set to 94.155.19.130 port 13935
> Got REQ_KEY from Backbone (164.138.216.106 port 655): 15 Office Lukav_Beast
>
> Program received signal SIGSEGV, Segmentation fault.
> 0x000055555556de41 in send_ans_key (to=to at entry=0x555555851060) at
> protocol_key.c:382
> 382        return send_request(to->nexthop->connection, "%d %s
%s %s %d
> %d %d %d", ANS_KEY,
> (gdb) bt
> #0  0x000055555556de41 in send_ans_key (to=to at entry=0x555555851060) at
> protocol_key.c:382
> #1  0x000055555556e169 in req_key_h (c=0x555555851be0,
> request=0x555555854bb7 "15 Office Lukav_Beast") at
protocol_key.c:304
> #2  0x000055555556a083 in receive_request (c=c at entry=0x555555851be0,
> request=0x555555854bb7 "15 Office Lukav_Beast") at protocol.c:146
> #3  0x000055555555e993 in receive_meta (c=c at entry=0x555555851be0) at
> meta.c:333
> #4  0x00005555555603f9 in handle_meta_connection_data
> (c=c at entry=0x555555851be0) at net.c:304
> #5  0x00005555555678c2 in handle_meta_io (data=0x555555851be0,
> flags=<optimized out>) at net_socket.c:520
> #6  0x000055555555c60a in event_loop () at event.c:359
> #7  0x00005555555607f2 in main_loop () at net.c:510
> #8  0x0000555555559208 in main (argc=6, argv=<optimized out>) at
tincd.c:558
> (gdb) bt full
> #0  0x000055555556de41 in send_ans_key (to=to at entry=0x555555851060) at
> protocol_key.c:382
>           keylen = <optimized out>
>           key >
"527E64B1DB47F2F527ADF7F609498FFCB4807AEC3CD49697D3D8D870619BC537E1B7C403875D81FC608A8F6E00D06063\000\306\377\377\377\177\000\000\331\334VUUU",
> '\000' <repeats 11 times>,
>
"*\322\316\000\305\000\000\000\000\000\000\000\000\340\033\205UUU\000\000\001\000\000\000\000\000\000\000P\316\377\377\377\177\000\000\267K\205UUU\000\000`\020\205UUU\000\000@\306\377\377\377\177\000\000i\341VUUU\000\000\000\000\000\000\377\177\000\000\000\000\000\000\000\000\000\000"...
> #1  0x000055555556e169 in req_key_h (c=0x555555851be0,
> request=0x555555854bb7 "15 Office Lukav_Beast") at
protocol_key.c:304
>           from_name = "Office\000\061\071.130", '\000'
<repeats 1003
> times>...
>           to_name = "Lukav_Beast", '\000' <repeats 366
times>...
>           from = 0x555555851060
>           to = <optimized out>
>           reqno = 0
> #2  0x000055555556a083 in receive_request (c=c at entry=0x555555851be0,
> request=0x555555854bb7 "15 Office Lukav_Beast") at protocol.c:146
>           reqno = <optimized out>
> #3  0x000055555555e993 in receive_meta (c=c at entry=0x555555851be0) at
> meta.c:333
>           result = <optimized out>
>           request = <optimized out>
>           inlen = 0
>           inbuf >
"a\354\357\063J\363{\346d\177\271\371;+\212\371zFDt\271\061\370\ao\373\326\035\255=\254\257:\245\322\v\205\035\336?1\234\372\001\004\063\323\t\004-\b8\367\f\201\342\304g\332\361jL76C\340-\t\006\210\214\314,C\352)a\314\fAe\260\226\313\337\360|\256\236\263\344\205\061\207\303\t<\016\351\360\222\343[\317o\377\065<?b(\267\321\356\360\242p$\314`\325\001|\036\204'\\\205i\314W\356#N4\000q\320\300\344\071\060\236w\016\306[\323X]\237\321\347\177\313KU\367\b}\307\374\367\032c\036\332:\307\367\265o\307\212J\006NJ3!\305q\367\255\263\246\200i\035\327\001"...
>           bufp = 0x7fffffffd6f0
>
"a\354\357\063J\363{\346d\177\271\371;+\212\371zFDt\271\061\370\ao\373\326\035\255=\254\257:\245\322\v\205\035\336?1\234\372\001\004\063\323\t\004-\b8\367\f\201\342\304g\332\361jL76C\340-\t\006\210\214\314,C\352)a\314\fAe\260\226\313\337\360|\256\236\263\344\205\061\207\303\t<\016\351\360\222\343[\317o\377\065<?b(\267\321\356\360\242p$\314`\325\001|\036\204'\\\205i\314W\356#N4"
>           endp = <optimized out>
> #4  0x00005555555603f9 in handle_meta_connection_data
> (c=c at entry=0x555555851be0) at net.c:304
> No locals.
> #5  0x00005555555678c2 in handle_meta_io (data=0x555555851be0,
> flags=<optimized out>) at net_socket.c:520
>           c = 0x555555851be0
>           socket_error = <optimized out>
>           len = <optimized out>
> #6  0x000055555555c60a in event_loop () at event.c:359
>           node = 0x555555797dd8 <signalio+24>
>           next = 0x555555797dd8 <signalio+24>
> ---Type <return> to continue, or q <return> to quit---
>           io = 0x555555851d90
>           tv = <optimized out>
>           fds = <optimized out>
>           curgen = 7
>           diff = {tv_sec = 0, tv_usec = 512516}
>           n = <optimized out>
>           readable = {fds_bits = {256, 0 <repeats 15 times>}}
>           writable = {fds_bits = {0 <repeats 16 times>}}
> #7  0x00005555555607f2 in main_loop () at net.c:510
>           sighup = {signum = 1, cb = 0x555555560480 <sighup_handler>,
> data = 0x7fffffffe1a0, node = {next = 0x7fffffffe2a8, prev = 0x0,
>               parent = 0x7fffffffe2a8, left = 0x0, right = 0x0, data >
0x7fffffffe1a0}}
>           sigterm = {signum = 15, cb = 0x55555555f900
<sigterm_handler>,
> data = 0x7fffffffe1f0, node = {next = 0x0, prev = 0x7fffffffe2f8,
>               parent = 0x7fffffffe2f8, left = 0x0, right = 0x0, data >
0x7fffffffe1f0}}
>           sigquit = {signum = 3, cb = 0x55555555f900
<sigterm_handler>,
> data = 0x7fffffffe240, node = {next = 0x7fffffffe2f8,
>               prev = 0x7fffffffe2a8, parent = 0x7fffffffe2f8, left >
0x7fffffffe2a8, right = 0x0, data = 0x7fffffffe240}}
>           sigint = {signum = 2, cb = 0x55555555f900
<sigterm_handler>,
> data = 0x7fffffffe290, node = {next = 0x7fffffffe258,
>               prev = 0x7fffffffe1b8, parent = 0x7fffffffe258, left >
0x7fffffffe1b8, right = 0x0, data = 0x7fffffffe290}}
>           sigalrm = {signum = 14, cb = 0x5555555605b0
<sigalrm_handler>,
> data = 0x7fffffffe2e0, node = {next = 0x7fffffffe208,
>               prev = 0x7fffffffe258, parent = 0x0, left = 0x7fffffffe258,
> right = 0x7fffffffe208, data = 0x7fffffffe2e0}}
> #8  0x0000555555559208 in main (argc=6, argv=<optimized out>) at
tincd.c:558
>           umbstr = <optimized out>
>           priority = 0x0
>
>
> Any help is much appreciated since my network is unusable at the moment
>
>
> _______________________________________________
> tinc-devel mailing list
> tinc-devel at tinc-vpn.org
> tinc-vpn.org/cgi-bin/mailman/listinfo/tinc-devel
>
> _______________________________________________
> tinc-devel mailing list
> tinc-devel at tinc-vpn.org
> tinc-vpn.org/cgi-bin/mailman/listinfo/tinc-devel-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<tinc-vpn.org/pipermail/tinc-devel/attachments/20200727/80491d94/attachment-0001.html>

tinc devel - Jul 2020 - SegFault when using TunnelServer=yes

SegFault when using TunnelServer=yes

SegFault when using TunnelServer=yes

SegFault when using TunnelServer=yes

SegFault when using TunnelServer=yes

Maybe Matching Threads