Hi, thank for getting back. I'll answer the questions, but I've already gave up on tinc and switch to zerotier-one. On 2020-07-27 5:10 p.m., borg at uu3.net wrote:> Hi. I have few questions out of curiosity.. Cant help for now with > your problem... > > What version is crashing? 1.1 or 1.0 ?1.1 is crashing> > How your network is segmented..? > I use tinc myself here a lot too (1.0) but my network is very segmented. > I use switch mode and handle routing myself, so mesh links arent large.. > > I would NOT go beyond 30 nodes for full auto-mesh.. its already like 435 > edges...Well it is not. It used to be switch mode before (like may be a year back) and I used dnsmasq to do DHCP on the nodes. However I've switched to router mode with static ips which reduces the traffic significantly and helped for a time. I think the problem is that the edges get like 2500+ and then when the central node crashes and restarts all the nodes are trying to reconnect. Once reconnected every nodes gives back those 2500+ edges to the central node which in turns tries to process them and forward them back to every connection already made. Since tinc is single thread that processing start to eat up the CPU and the nodes start to beleave the connection is dead and reconnect again which in turns starts the hole process itself. In my opinion this is a design flaw in tinc. The notion to every node to know about every other nodes just limits how many nodes can be handled. In my case may be the situation could be mitigated with TunnelServer, but that leads to the crash, and further more would make for the other nodes to not be able to connect to each other. I think a better approach would be for the nodes to exchange information only when a link is to be established (something like arp). Like if node A want to contact node C but has a connection only to B and N to ask them if they know something about C, if they don't, then they in turn could ask their connections and so on. Anyway, since I've switch to zerotier I have no problems and so far it works great. Best Regards> > Regards, > Borg > > > ---------- Original message ---------- > > From: Anton Avramov <SRS0=TSOC=AB=lukav.com=lukav at mijnuvt.nl> > To: tinc-devel at tinc-vpn.org > Subject: SegFault when using TunnelServer=yes > Date: Fri, 19 Jun 2020 12:22:36 -0400 > > Hi all, > > I have a network with about ~800. The network is a mix of tinc 1.0 and > 1.1 nodes. It is gradually expanding for several years now. > > The problem is that at some point it seams the daemon can not handle the > processing of the new connection and the edges. > > There are 3 major nodes in the system and every other node initially > makes connection to one of them. > > Now after a lot of debugging I've limited to all nodes to connect only > to one node, and use iptables to grant new connections gradually. last > limit was 5 per minute. > > I've started to monitor how the edges are growing on the main node and I > see that although I've limited the connections on the other 2 major > nodes at some point there are rapid spikes in the edges when new > connection is established. > So my guess is that the other nodes have a previous state on the edges > when they try to push it, that is causing the main nodes to become > overwhelmed. > > So I've decided to put TunnelServer=yes on the major nodes so they don't > propagate the connections on the other nodes. > > However I get a segfault soon after starting on each node that I enable > that option. > > I've build from the latest code and here is a trace of such a run: (this > is not from a "major" node, but the effect is the same) > > Got ANS_KEY from Backbone (164.138.216.106 port 655): 16 Office > Lukav_Beast > 52201D7CFDC2C7E1FD7871A36E651B7AC24A52B4ED892CD953397F6BA859AB22D5D4CB235B9CF85910B6BDE91A34C85E > 427 672 4 0 94.155.19.130 13935 > Using reflexive UDP address from Office: 94.155.19.130 port 13935 > UDP address of Office set to 94.155.19.130 port 13935 > Got REQ_KEY from Backbone (164.138.216.106 port 655): 15 Office Lukav_Beast > > Program received signal SIGSEGV, Segmentation fault. > 0x000055555556de41 in send_ans_key (to=to at entry=0x555555851060) at > protocol_key.c:382 > 382 return send_request(to->nexthop->connection, "%d %s %s %s %d > %d %d %d", ANS_KEY, > (gdb) bt > #0 0x000055555556de41 in send_ans_key (to=to at entry=0x555555851060) at > protocol_key.c:382 > #1 0x000055555556e169 in req_key_h (c=0x555555851be0, > request=0x555555854bb7 "15 Office Lukav_Beast") at protocol_key.c:304 > #2 0x000055555556a083 in receive_request (c=c at entry=0x555555851be0, > request=0x555555854bb7 "15 Office Lukav_Beast") at protocol.c:146 > #3 0x000055555555e993 in receive_meta (c=c at entry=0x555555851be0) at > meta.c:333 > #4 0x00005555555603f9 in handle_meta_connection_data > (c=c at entry=0x555555851be0) at net.c:304 > #5 0x00005555555678c2 in handle_meta_io (data=0x555555851be0, > flags=<optimized out>) at net_socket.c:520 > #6 0x000055555555c60a in event_loop () at event.c:359 > #7 0x00005555555607f2 in main_loop () at net.c:510 > #8 0x0000555555559208 in main (argc=6, argv=<optimized out>) at tincd.c:558 > (gdb) bt full > #0 0x000055555556de41 in send_ans_key (to=to at entry=0x555555851060) at > protocol_key.c:382 > keylen = <optimized out> > key > "527E64B1DB47F2F527ADF7F609498FFCB4807AEC3CD49697D3D8D870619BC537E1B7C403875D81FC608A8F6E00D06063\000\306\377\377\377\177\000\000\331\334VUUU", > '\000' <repeats 11 times>, > "*\322\316\000\305\000\000\000\000\000\000\000\000\340\033\205UUU\000\000\001\000\000\000\000\000\000\000P\316\377\377\377\177\000\000\267K\205UUU\000\000`\020\205UUU\000\000@\306\377\377\377\177\000\000i\341VUUU\000\000\000\000\000\000\377\177\000\000\000\000\000\000\000\000\000\000"... > #1 0x000055555556e169 in req_key_h (c=0x555555851be0, > request=0x555555854bb7 "15 Office Lukav_Beast") at protocol_key.c:304 > from_name = "Office\000\061\071.130", '\000' <repeats 1003 > times>... > to_name = "Lukav_Beast", '\000' <repeats 366 times>... > from = 0x555555851060 > to = <optimized out> > reqno = 0 > #2 0x000055555556a083 in receive_request (c=c at entry=0x555555851be0, > request=0x555555854bb7 "15 Office Lukav_Beast") at protocol.c:146 > reqno = <optimized out> > #3 0x000055555555e993 in receive_meta (c=c at entry=0x555555851be0) at > meta.c:333 > result = <optimized out> > request = <optimized out> > inlen = 0 > inbuf > "a\354\357\063J\363{\346d\177\271\371;+\212\371zFDt\271\061\370\ao\373\326\035\255=\254\257:\245\322\v\205\035\336?1\234\372\001\004\063\323\t\004-\b8\367\f\201\342\304g\332\361jL76C\340-\t\006\210\214\314,C\352)a\314\fAe\260\226\313\337\360|\256\236\263\344\205\061\207\303\t<\016\351\360\222\343[\317o\377\065<?b(\267\321\356\360\242p$\314`\325\001|\036\204'\\\205i\314W\356#N4\000q\320\300\344\071\060\236w\016\306[\323X]\237\321\347\177\313KU\367\b}\307\374\367\032c\036\332:\307\367\265o\307\212J\006NJ3!\305q\367\255\263\246\200i\035\327\001"... > bufp = 0x7fffffffd6f0 > "a\354\357\063J\363{\346d\177\271\371;+\212\371zFDt\271\061\370\ao\373\326\035\255=\254\257:\245\322\v\205\035\336?1\234\372\001\004\063\323\t\004-\b8\367\f\201\342\304g\332\361jL76C\340-\t\006\210\214\314,C\352)a\314\fAe\260\226\313\337\360|\256\236\263\344\205\061\207\303\t<\016\351\360\222\343[\317o\377\065<?b(\267\321\356\360\242p$\314`\325\001|\036\204'\\\205i\314W\356#N4" > endp = <optimized out> > #4 0x00005555555603f9 in handle_meta_connection_data > (c=c at entry=0x555555851be0) at net.c:304 > No locals. > #5 0x00005555555678c2 in handle_meta_io (data=0x555555851be0, > flags=<optimized out>) at net_socket.c:520 > c = 0x555555851be0 > socket_error = <optimized out> > len = <optimized out> > #6 0x000055555555c60a in event_loop () at event.c:359 > node = 0x555555797dd8 <signalio+24> > next = 0x555555797dd8 <signalio+24> > ---Type <return> to continue, or q <return> to quit--- > io = 0x555555851d90 > tv = <optimized out> > fds = <optimized out> > curgen = 7 > diff = {tv_sec = 0, tv_usec = 512516} > n = <optimized out> > readable = {fds_bits = {256, 0 <repeats 15 times>}} > writable = {fds_bits = {0 <repeats 16 times>}} > #7 0x00005555555607f2 in main_loop () at net.c:510 > sighup = {signum = 1, cb = 0x555555560480 <sighup_handler>, > data = 0x7fffffffe1a0, node = {next = 0x7fffffffe2a8, prev = 0x0, > parent = 0x7fffffffe2a8, left = 0x0, right = 0x0, data > 0x7fffffffe1a0}} > sigterm = {signum = 15, cb = 0x55555555f900 <sigterm_handler>, > data = 0x7fffffffe1f0, node = {next = 0x0, prev = 0x7fffffffe2f8, > parent = 0x7fffffffe2f8, left = 0x0, right = 0x0, data > 0x7fffffffe1f0}} > sigquit = {signum = 3, cb = 0x55555555f900 <sigterm_handler>, > data = 0x7fffffffe240, node = {next = 0x7fffffffe2f8, > prev = 0x7fffffffe2a8, parent = 0x7fffffffe2f8, left > 0x7fffffffe2a8, right = 0x0, data = 0x7fffffffe240}} > sigint = {signum = 2, cb = 0x55555555f900 <sigterm_handler>, > data = 0x7fffffffe290, node = {next = 0x7fffffffe258, > prev = 0x7fffffffe1b8, parent = 0x7fffffffe258, left > 0x7fffffffe1b8, right = 0x0, data = 0x7fffffffe290}} > sigalrm = {signum = 14, cb = 0x5555555605b0 <sigalrm_handler>, > data = 0x7fffffffe2e0, node = {next = 0x7fffffffe208, > prev = 0x7fffffffe258, parent = 0x0, left = 0x7fffffffe258, > right = 0x7fffffffe208, data = 0x7fffffffe2e0}} > #8 0x0000555555559208 in main (argc=6, argv=<optimized out>) at tincd.c:558 > umbstr = <optimized out> > priority = 0x0 > > > Any help is much appreciated since my network is unusable at the moment > > > _______________________________________________ > tinc-devel mailing list > tinc-devel at tinc-vpn.org > https://www.tinc-vpn.org/cgi-bin/mailman/listinfo/tinc-devel > > _______________________________________________ > tinc-devel mailing list > tinc-devel at tinc-vpn.org > https://www.tinc-vpn.org/cgi-bin/mailman/listinfo/tinc-devel-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://www.tinc-vpn.org/pipermail/tinc-devel/attachments/20200727/80491d94/attachment-0001.html>
Thanks for answers. I think its now flaw.. but design.. Tinc auto-mesh is very very handy. You just need to avoid flat networks. There is also IndirectMode w/ forces nodes to be switched by intermediate node... but I would be cautionus how its used. I use it myself for certain nodes behind NATs where they cannot be connected to, so always connect node handles switching for them. You noticed it yourself, you had huge amount of edges and you probably hit the limitation of tinc itself... So, in zerotier all works fine? you still have flat (mesh) network design? Or you redesigned network aswell? ---------- Original message ---------- From: Anton Avramov <SRS0=7cKF=BG=lukav.com=lukav at mijnuvt.nl> To: tinc-devel at tinc-vpn.org Subject: Re: SegFault when using TunnelServer=yes Date: Mon, 27 Jul 2020 17:35:21 -0400 Hi, thank for getting back. I'll answer the questions, but I've already gave up on tinc and switch to zerotier-one. On 2020-07-27 5:10 p.m., borg at uu3.net wrote:> Hi. I have few questions out of curiosity.. Cant help for now with > your problem... > > What version is crashing? 1.1 or 1.0 ?1.1 is crashing> > How your network is segmented..? > I use tinc myself here a lot too (1.0) but my network is very segmented. > I use switch mode and handle routing myself, so mesh links arent large.. > > I would NOT go beyond 30 nodes for full auto-mesh.. its already like 435 > edges...Well it is not. It used to be switch mode before (like may be a year back) and I used dnsmasq to do DHCP on the nodes. However I've switched to router mode with static ips which reduces the traffic significantly and helped for a time. I think the problem is that the edges get like 2500+ and then when the central node crashes and restarts all the nodes are trying to reconnect. Once reconnected every nodes gives back those 2500+ edges to the central node which in turns tries to process them and forward them back to every connection already made. Since tinc is single thread that processing start to eat up the CPU and the nodes start to beleave the connection is dead and reconnect again which in turns starts the hole process itself. In my opinion this is a design flaw in tinc. The notion to every node to know about every other nodes just limits how many nodes can be handled. In my case may be the situation could be mitigated with TunnelServer, but that leads to the crash, and further more would make for the other nodes to not be able to connect to each other. I think a better approach would be for the nodes to exchange information only when a link is to be established (something like arp). Like if node A want to contact node C but has a connection only to B and N to ask them if they know something about C, if they don't, then they in turn could ask their connections and so on. Anyway, since I've switch to zerotier I have no problems and so far it works great. Best Regards> > Regards, > Borg > > > ---------- Original message ---------- > > From: Anton Avramov <SRS0=TSOC=AB=lukav.com=lukav at mijnuvt.nl> > To: tinc-devel at tinc-vpn.org > Subject: SegFault when using TunnelServer=yes > Date: Fri, 19 Jun 2020 12:22:36 -0400 > > Hi all, > > I have a network with about ~800. The network is a mix of tinc 1.0 and > 1.1 nodes. It is gradually expanding for several years now. > > The problem is that at some point it seams the daemon can not handle the > processing of the new connection and the edges. > > There are 3 major nodes in the system and every other node initially > makes connection to one of them. > > Now after a lot of debugging I've limited to all nodes to connect only > to one node, and use iptables to grant new connections gradually. last > limit was 5 per minute. > > I've started to monitor how the edges are growing on the main node and I > see that although I've limited the connections on the other 2 major > nodes at some point there are rapid spikes in the edges when new > connection is established. > So my guess is that the other nodes have a previous state on the edges > when they try to push it, that is causing the main nodes to become > overwhelmed. > > So I've decided to put TunnelServer=yes on the major nodes so they don't > propagate the connections on the other nodes. > > However I get a segfault soon after starting on each node that I enable > that option. > > I've build from the latest code and here is a trace of such a run: (this > is not from a "major" node, but the effect is the same) > > Got ANS_KEY from Backbone (164.138.216.106 port 655): 16 Office > Lukav_Beast > 52201D7CFDC2C7E1FD7871A36E651B7AC24A52B4ED892CD953397F6BA859AB22D5D4CB235B9CF85910B6BDE91A34C85E > 427 672 4 0 94.155.19.130 13935 > Using reflexive UDP address from Office: 94.155.19.130 port 13935 > UDP address of Office set to 94.155.19.130 port 13935 > Got REQ_KEY from Backbone (164.138.216.106 port 655): 15 Office Lukav_Beast > > Program received signal SIGSEGV, Segmentation fault. > 0x000055555556de41 in send_ans_key (to=to at entry=0x555555851060) at > protocol_key.c:382 > 382 return send_request(to->nexthop->connection, "%d %s %s %s %d > %d %d %d", ANS_KEY, > (gdb) bt > #0 0x000055555556de41 in send_ans_key (to=to at entry=0x555555851060) at > protocol_key.c:382 > #1 0x000055555556e169 in req_key_h (c=0x555555851be0, > request=0x555555854bb7 "15 Office Lukav_Beast") at protocol_key.c:304 > #2 0x000055555556a083 in receive_request (c=c at entry=0x555555851be0, > request=0x555555854bb7 "15 Office Lukav_Beast") at protocol.c:146 > #3 0x000055555555e993 in receive_meta (c=c at entry=0x555555851be0) at > meta.c:333 > #4 0x00005555555603f9 in handle_meta_connection_data > (c=c at entry=0x555555851be0) at net.c:304 > #5 0x00005555555678c2 in handle_meta_io (data=0x555555851be0, > flags=<optimized out>) at net_socket.c:520 > #6 0x000055555555c60a in event_loop () at event.c:359 > #7 0x00005555555607f2 in main_loop () at net.c:510 > #8 0x0000555555559208 in main (argc=6, argv=<optimized out>) at tincd.c:558 > (gdb) bt full > #0 0x000055555556de41 in send_ans_key (to=to at entry=0x555555851060) at > protocol_key.c:382 > keylen = <optimized out> > key > "527E64B1DB47F2F527ADF7F609498FFCB4807AEC3CD49697D3D8D870619BC537E1B7C403875D81FC608A8F6E00D06063\000\306\377\377\377\177\000\000\331\334VUUU", > '\000' <repeats 11 times>, > "*\322\316\000\305\000\000\000\000\000\000\000\000\340\033\205UUU\000\000\001\000\000\000\000\000\000\000P\316\377\377\377\177\000\000\267K\205UUU\000\000`\020\205UUU\000\000@\306\377\377\377\177\000\000i\341VUUU\000\000\000\000\000\000\377\177\000\000\000\000\000\000\000\000\000\000"... > #1 0x000055555556e169 in req_key_h (c=0x555555851be0, > request=0x555555854bb7 "15 Office Lukav_Beast") at protocol_key.c:304 > from_name = "Office\000\061\071.130", '\000' <repeats 1003 > times>... > to_name = "Lukav_Beast", '\000' <repeats 366 times>... > from = 0x555555851060 > to = <optimized out> > reqno = 0 > #2 0x000055555556a083 in receive_request (c=c at entry=0x555555851be0, > request=0x555555854bb7 "15 Office Lukav_Beast") at protocol.c:146 > reqno = <optimized out> > #3 0x000055555555e993 in receive_meta (c=c at entry=0x555555851be0) at > meta.c:333 > result = <optimized out> > request = <optimized out> > inlen = 0 > inbuf > "a\354\357\063J\363{\346d\177\271\371;+\212\371zFDt\271\061\370\ao\373\326\035\255=\254\257:\245\322\v\205\035\336?1\234\372\001\004\063\323\t\004-\b8\367\f\201\342\304g\332\361jL76C\340-\t\006\210\214\314,C\352)a\314\fAe\260\226\313\337\360|\256\236\263\344\205\061\207\303\t<\016\351\360\222\343[\317o\377\065<?b(\267\321\356\360\242p$\314`\325\001|\036\204'\\\205i\314W\356#N4\000q\320\300\344\071\060\236w\016\306[\323X]\237\321\347\177\313KU\367\b}\307\374\367\032c\036\332:\307\367\265o\307\212J\006NJ3!\305q\367\255\263\246\200i\035\327\001"... > bufp = 0x7fffffffd6f0 > "a\354\357\063J\363{\346d\177\271\371;+\212\371zFDt\271\061\370\ao\373\326\035\255=\254\257:\245\322\v\205\035\336?1\234\372\001\004\063\323\t\004-\b8\367\f\201\342\304g\332\361jL76C\340-\t\006\210\214\314,C\352)a\314\fAe\260\226\313\337\360|\256\236\263\344\205\061\207\303\t<\016\351\360\222\343[\317o\377\065<?b(\267\321\356\360\242p$\314`\325\001|\036\204'\\\205i\314W\356#N4" > endp = <optimized out> > #4 0x00005555555603f9 in handle_meta_connection_data > (c=c at entry=0x555555851be0) at net.c:304 > No locals. > #5 0x00005555555678c2 in handle_meta_io (data=0x555555851be0, > flags=<optimized out>) at net_socket.c:520 > c = 0x555555851be0 > socket_error = <optimized out> > len = <optimized out> > #6 0x000055555555c60a in event_loop () at event.c:359 > node = 0x555555797dd8 <signalio+24> > next = 0x555555797dd8 <signalio+24> > ---Type <return> to continue, or q <return> to quit--- > io = 0x555555851d90 > tv = <optimized out> > fds = <optimized out> > curgen = 7 > diff = {tv_sec = 0, tv_usec = 512516} > n = <optimized out> > readable = {fds_bits = {256, 0 <repeats 15 times>}} > writable = {fds_bits = {0 <repeats 16 times>}} > #7 0x00005555555607f2 in main_loop () at net.c:510 > sighup = {signum = 1, cb = 0x555555560480 <sighup_handler>, > data = 0x7fffffffe1a0, node = {next = 0x7fffffffe2a8, prev = 0x0, > parent = 0x7fffffffe2a8, left = 0x0, right = 0x0, data > 0x7fffffffe1a0}} > sigterm = {signum = 15, cb = 0x55555555f900 <sigterm_handler>, > data = 0x7fffffffe1f0, node = {next = 0x0, prev = 0x7fffffffe2f8, > parent = 0x7fffffffe2f8, left = 0x0, right = 0x0, data > 0x7fffffffe1f0}} > sigquit = {signum = 3, cb = 0x55555555f900 <sigterm_handler>, > data = 0x7fffffffe240, node = {next = 0x7fffffffe2f8, > prev = 0x7fffffffe2a8, parent = 0x7fffffffe2f8, left > 0x7fffffffe2a8, right = 0x0, data = 0x7fffffffe240}} > sigint = {signum = 2, cb = 0x55555555f900 <sigterm_handler>, > data = 0x7fffffffe290, node = {next = 0x7fffffffe258, > prev = 0x7fffffffe1b8, parent = 0x7fffffffe258, left > 0x7fffffffe1b8, right = 0x0, data = 0x7fffffffe290}} > sigalrm = {signum = 14, cb = 0x5555555605b0 <sigalrm_handler>, > data = 0x7fffffffe2e0, node = {next = 0x7fffffffe208, > prev = 0x7fffffffe258, parent = 0x0, left = 0x7fffffffe258, > right = 0x7fffffffe208, data = 0x7fffffffe2e0}} > #8 0x0000555555559208 in main (argc=6, argv=<optimized out>) at tincd.c:558 > umbstr = <optimized out> > priority = 0x0 > > > Any help is much appreciated since my network is unusable at the moment > > > _______________________________________________ > tinc-devel mailing list > tinc-devel at tinc-vpn.org > https://www.tinc-vpn.org/cgi-bin/mailman/listinfo/tinc-devel > > _______________________________________________ > tinc-devel mailing list > tinc-devel at tinc-vpn.org > https://www.tinc-vpn.org/cgi-bin/mailman/listinfo/tinc-devel
Hi, On 2020-07-28 5:09 a.m., borg at uu3.net wrote:> Thanks for answers. > > I think its now flaw.. but design.. Tinc auto-mesh is very very handy. > You just need to avoid flat networks.Agreed. Tinc is a great peace of software that I have used for may be more than 10 years. It just that it has its limits that can be overcome with a design change. Please note that that change would not break current features. It may slow down a little the initial connection till the nodes learn for each other, but other than that it would still work as before.> > There is also IndirectMode w/ forces nodes to be switched by > intermediate node... but I would be cautionus how its used. > I use it myself for certain nodes behind NATs where they > cannot be connected to, so always connect node handles switching for them.I haven't seen such a mode either in the docs or in the code itself. There is IndirectData, but according to the docs if turned on it would actually drop packed that are forwarded so it does the opposite of what you are describing. In any case it would not help, since the problem is not the volume of the data transferred, but the number of edges and the CPU power of the nodes.> > You noticed it yourself, you had huge amount of edges and you probably > hit the limitation of tinc itself... > > So, in zerotier all works fine? you still have flat (mesh) > network design? Or you redesigned network aswell?Yes. It is still a flat mesh network. However it works kind-of how I suggested for the design change. It has several Root nodes (you can add more of your own) and then it uses those to learn about other nodes it wants to make a connection. Once learned it would try all the NAT tricks to establish a direct connection and if that fails it would forward data through the root nodes. There are some differences that also help, like the IP assignments (and routing) is done in a controller and propagated to the nodes, so once the node is in your network you can make changes in one place instead of the node itself. I would suggest to have a look at https://www.zerotier.com/manual/ for better understanding. I would love tinc to continue growing, however I know that the author is busy with other projects, so my understanding is that future developments are left to the community. I personally don't have the experience in C++ and low level network to offer meaningful contribution. Best regards> > ---------- Original message ---------- > > From: Anton Avramov <SRS0=7cKF=BG=lukav.com=lukav at mijnuvt.nl> > To: tinc-devel at tinc-vpn.org > Subject: Re: SegFault when using TunnelServer=yes > Date: Mon, 27 Jul 2020 17:35:21 -0400 > > Hi, thank for getting back. > > I'll answer the questions, but I've already gave up on tinc and switch to > zerotier-one. > > On 2020-07-27 5:10 p.m., borg at uu3.net wrote: >> Hi. I have few questions out of curiosity.. Cant help for now with >> your problem... >> >> What version is crashing? 1.1 or 1.0 ? > 1.1 is crashing >> How your network is segmented..? >> I use tinc myself here a lot too (1.0) but my network is very segmented. >> I use switch mode and handle routing myself, so mesh links arent large.. >> >> I would NOT go beyond 30 nodes for full auto-mesh.. its already like 435 >> edges... > Well it is not. It used to be switch mode before (like may be a year back) and I > used dnsmasq to do DHCP on the nodes. > > However I've switched to router mode with static ips which reduces the traffic > significantly and helped for a time. > > I think the problem is that the edges get like 2500+ and then when the central > node crashes and restarts all the nodes are trying to reconnect. Once > reconnected every nodes gives back those 2500+ edges to the central node which > in turns tries to process them and forward them back to every connection already > made. > Since tinc is single thread that processing start to eat up the CPU and the > nodes start to beleave the connection is dead and reconnect again which in turns > starts the hole process itself. > > In my opinion this is a design flaw in tinc. The notion to every node to know > about every other nodes just limits how many nodes can be handled. > > In my case may be the situation could be mitigated with TunnelServer, but that > leads to the crash, and further more would make for the other nodes to not be > able to connect to each other. > > I think a better approach would be for the nodes to exchange information only > when a link is to be established (something like arp). Like if node A want to > contact node C but has a connection only to B and N to ask them if they know > something about C, if they don't, then they in turn could ask their connections > and so on. > > Anyway, since I've switch to zerotier I have no problems and so far it works > great. > > Best Regards > >> Regards, >> Borg >> >> >> ---------- Original message ---------- >> >> From: Anton Avramov <SRS0=TSOC=AB=lukav.com=lukav at mijnuvt.nl> >> To: tinc-devel at tinc-vpn.org >> Subject: SegFault when using TunnelServer=yes >> Date: Fri, 19 Jun 2020 12:22:36 -0400 >> >> Hi all, >> >> I have a network with about ~800. The network is a mix of tinc 1.0 and >> 1.1 nodes. It is gradually expanding for several years now. >> >> The problem is that at some point it seams the daemon can not handle the >> processing of the new connection and the edges. >> >> There are 3 major nodes in the system and every other node initially >> makes connection to one of them. >> >> Now after a lot of debugging I've limited to all nodes to connect only >> to one node, and use iptables to grant new connections gradually. last >> limit was 5 per minute. >> >> I've started to monitor how the edges are growing on the main node and I >> see that although I've limited the connections on the other 2 major >> nodes at some point there are rapid spikes in the edges when new >> connection is established. >> So my guess is that the other nodes have a previous state on the edges >> when they try to push it, that is causing the main nodes to become >> overwhelmed. >> >> So I've decided to put TunnelServer=yes on the major nodes so they don't >> propagate the connections on the other nodes. >> >> However I get a segfault soon after starting on each node that I enable >> that option. >> >> I've build from the latest code and here is a trace of such a run: (this >> is not from a "major" node, but the effect is the same) >> >> Got ANS_KEY from Backbone (164.138.216.106 port 655): 16 Office >> Lukav_Beast >> 52201D7CFDC2C7E1FD7871A36E651B7AC24A52B4ED892CD953397F6BA859AB22D5D4CB235B9CF85910B6BDE91A34C85E >> 427 672 4 0 94.155.19.130 13935 >> Using reflexive UDP address from Office: 94.155.19.130 port 13935 >> UDP address of Office set to 94.155.19.130 port 13935 >> Got REQ_KEY from Backbone (164.138.216.106 port 655): 15 Office Lukav_Beast >> >> Program received signal SIGSEGV, Segmentation fault. >> 0x000055555556de41 in send_ans_key (to=to at entry=0x555555851060) at >> protocol_key.c:382 >> 382 return send_request(to->nexthop->connection, "%d %s %s %s %d >> %d %d %d", ANS_KEY, >> (gdb) bt >> #0 0x000055555556de41 in send_ans_key (to=to at entry=0x555555851060) at >> protocol_key.c:382 >> #1 0x000055555556e169 in req_key_h (c=0x555555851be0, >> request=0x555555854bb7 "15 Office Lukav_Beast") at protocol_key.c:304 >> #2 0x000055555556a083 in receive_request (c=c at entry=0x555555851be0, >> request=0x555555854bb7 "15 Office Lukav_Beast") at protocol.c:146 >> #3 0x000055555555e993 in receive_meta (c=c at entry=0x555555851be0) at >> meta.c:333 >> #4 0x00005555555603f9 in handle_meta_connection_data >> (c=c at entry=0x555555851be0) at net.c:304 >> #5 0x00005555555678c2 in handle_meta_io (data=0x555555851be0, >> flags=<optimized out>) at net_socket.c:520 >> #6 0x000055555555c60a in event_loop () at event.c:359 >> #7 0x00005555555607f2 in main_loop () at net.c:510 >> #8 0x0000555555559208 in main (argc=6, argv=<optimized out>) at tincd.c:558 >> (gdb) bt full >> #0 0x000055555556de41 in send_ans_key (to=to at entry=0x555555851060) at >> protocol_key.c:382 >> keylen = <optimized out> >> key >> "527E64B1DB47F2F527ADF7F609498FFCB4807AEC3CD49697D3D8D870619BC537E1B7C403875D81FC608A8F6E00D06063\000\306\377\377\377\177\000\000\331\334VUUU", >> '\000' <repeats 11 times>, >> "*\322\316\000\305\000\000\000\000\000\000\000\000\340\033\205UUU\000\000\001\000\000\000\000\000\000\000P\316\377\377\377\177\000\000\267K\205UUU\000\000`\020\205UUU\000\000@\306\377\377\377\177\000\000i\341VUUU\000\000\000\000\000\000\377\177\000\000\000\000\000\000\000\000\000\000"... >> #1 0x000055555556e169 in req_key_h (c=0x555555851be0, >> request=0x555555854bb7 "15 Office Lukav_Beast") at protocol_key.c:304 >> from_name = "Office\000\061\071.130", '\000' <repeats 1003 >> times>... >> to_name = "Lukav_Beast", '\000' <repeats 366 times>... >> from = 0x555555851060 >> to = <optimized out> >> reqno = 0 >> #2 0x000055555556a083 in receive_request (c=c at entry=0x555555851be0, >> request=0x555555854bb7 "15 Office Lukav_Beast") at protocol.c:146 >> reqno = <optimized out> >> #3 0x000055555555e993 in receive_meta (c=c at entry=0x555555851be0) at >> meta.c:333 >> result = <optimized out> >> request = <optimized out> >> inlen = 0 >> inbuf >> "a\354\357\063J\363{\346d\177\271\371;+\212\371zFDt\271\061\370\ao\373\326\035\255=\254\257:\245\322\v\205\035\336?1\234\372\001\004\063\323\t\004-\b8\367\f\201\342\304g\332\361jL76C\340-\t\006\210\214\314,C\352)a\314\fAe\260\226\313\337\360|\256\236\263\344\205\061\207\303\t<\016\351\360\222\343[\317o\377\065<?b(\267\321\356\360\242p$\314`\325\001|\036\204'\\\205i\314W\356#N4\000q\320\300\344\071\060\236w\016\306[\323X]\237\321\347\177\313KU\367\b}\307\374\367\032c\036\332:\307\367\265o\307\212J\006NJ3!\305q\367\255\263\246\200i\035\327\001"... >> bufp = 0x7fffffffd6f0 >> "a\354\357\063J\363{\346d\177\271\371;+\212\371zFDt\271\061\370\ao\373\326\035\255=\254\257:\245\322\v\205\035\336?1\234\372\001\004\063\323\t\004-\b8\367\f\201\342\304g\332\361jL76C\340-\t\006\210\214\314,C\352)a\314\fAe\260\226\313\337\360|\256\236\263\344\205\061\207\303\t<\016\351\360\222\343[\317o\377\065<?b(\267\321\356\360\242p$\314`\325\001|\036\204'\\\205i\314W\356#N4" >> endp = <optimized out> >> #4 0x00005555555603f9 in handle_meta_connection_data >> (c=c at entry=0x555555851be0) at net.c:304 >> No locals. >> #5 0x00005555555678c2 in handle_meta_io (data=0x555555851be0, >> flags=<optimized out>) at net_socket.c:520 >> c = 0x555555851be0 >> socket_error = <optimized out> >> len = <optimized out> >> #6 0x000055555555c60a in event_loop () at event.c:359 >> node = 0x555555797dd8 <signalio+24> >> next = 0x555555797dd8 <signalio+24> >> ---Type <return> to continue, or q <return> to quit--- >> io = 0x555555851d90 >> tv = <optimized out> >> fds = <optimized out> >> curgen = 7 >> diff = {tv_sec = 0, tv_usec = 512516} >> n = <optimized out> >> readable = {fds_bits = {256, 0 <repeats 15 times>}} >> writable = {fds_bits = {0 <repeats 16 times>}} >> #7 0x00005555555607f2 in main_loop () at net.c:510 >> sighup = {signum = 1, cb = 0x555555560480 <sighup_handler>, >> data = 0x7fffffffe1a0, node = {next = 0x7fffffffe2a8, prev = 0x0, >> parent = 0x7fffffffe2a8, left = 0x0, right = 0x0, data >> 0x7fffffffe1a0}} >> sigterm = {signum = 15, cb = 0x55555555f900 <sigterm_handler>, >> data = 0x7fffffffe1f0, node = {next = 0x0, prev = 0x7fffffffe2f8, >> parent = 0x7fffffffe2f8, left = 0x0, right = 0x0, data >> 0x7fffffffe1f0}} >> sigquit = {signum = 3, cb = 0x55555555f900 <sigterm_handler>, >> data = 0x7fffffffe240, node = {next = 0x7fffffffe2f8, >> prev = 0x7fffffffe2a8, parent = 0x7fffffffe2f8, left >> 0x7fffffffe2a8, right = 0x0, data = 0x7fffffffe240}} >> sigint = {signum = 2, cb = 0x55555555f900 <sigterm_handler>, >> data = 0x7fffffffe290, node = {next = 0x7fffffffe258, >> prev = 0x7fffffffe1b8, parent = 0x7fffffffe258, left >> 0x7fffffffe1b8, right = 0x0, data = 0x7fffffffe290}} >> sigalrm = {signum = 14, cb = 0x5555555605b0 <sigalrm_handler>, >> data = 0x7fffffffe2e0, node = {next = 0x7fffffffe208, >> prev = 0x7fffffffe258, parent = 0x0, left = 0x7fffffffe258, >> right = 0x7fffffffe208, data = 0x7fffffffe2e0}} >> #8 0x0000555555559208 in main (argc=6, argv=<optimized out>) at tincd.c:558 >> umbstr = <optimized out> >> priority = 0x0 >> >> >> Any help is much appreciated since my network is unusable at the moment >> >> >> _______________________________________________ >> tinc-devel mailing list >> tinc-devel at tinc-vpn.org >> https://www.tinc-vpn.org/cgi-bin/mailman/listinfo/tinc-devel >> >> _______________________________________________ >> tinc-devel mailing list >> tinc-devel at tinc-vpn.org >> https://www.tinc-vpn.org/cgi-bin/mailman/listinfo/tinc-devel >> >> _______________________________________________ >> tinc-devel mailing list >> tinc-devel at tinc-vpn.org >> https://www.tinc-vpn.org/cgi-bin/mailman/listinfo/tinc-devel-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://www.tinc-vpn.org/pipermail/tinc-devel/attachments/20200728/880a45ef/attachment-0001.html>
Hello, thank you for sharing your experiences! Am Mon, 27 Jul 2020 17:35:21 -0400 schrieb Anton Avramov <SRS0=7cKF=BG=lukav.com=lukav at mijnuvt.nl>:> Anyway, since I've switch to zerotier I have no problems and so far it works great.Just in case it was not obvious for everyone: "zerotier" is published under a non-free license (due to usage restrictions): https://www.zerotier.com/pricing/ Maybe this is relevant for some readers. Cheers, Lars