thr3ads.net - tinc devel - tinc 1.1: missing PONG [Feb 2018]

If this information is useful, please help other people find it:
Share via:

Todd C. Miller

2018-Feb-23 22:04 UTC

tinc 1.1: missing PONG

I've noticed that on Windows during large transfers (such as an
iperf run), the receiving end sometimes fails to respond to a PING.
When this happens, the sender closes the connection and both ends
have to renegotiate.

My theory is that, due to the properties of the splay tree, the TAP
device event is almost always at the head of the tree and thus if
there is traffic, that event will be serviced at the expense of
meta communication.

Because splay_search() is used to find the first event returned by
WSAWaitForMultipleEvents(), the next time through the loop the last
event serviced will be the first in the events[] array.  This doesn't
happen with the select() version of the event loop.

It is possible to eliminate the splay_search() by using an extra
array to hold io_t pointers but I've had better results just filling
the events[] array in reverse order.  This results in the most
recently serviced event being added to the end of events[], which
should prevent one very busy event from monopolizing the event loop.

With this change I can no longer reproduce the problem of nodes not
responding to pings.

Another solution would to randomly shuffle events[] but that seems
needlessly complicated.

 - todd

diff --git a/src/event.c b/src/event.c
index 331872a..03111d0 100644
--- a/src/event.c
+++ b/src/event.c
@@ -403,11 +403,14 @@ bool event_loop(void) {
 		}
 
 		WSAEVENT *events = xmalloc(event_count * sizeof(*events));
-		DWORD event_index = 0;
+		DWORD event_index = event_count;
 
+		/*
+		 * Fill events[] in reverse order.  Otherwise we may starve
+		 * events other than the TAP, which is usually at the head.
+		 */
 		for splay_each(io_t, io, &io_tree) {
-			events[event_index] = io->event;
-			event_index++;
+			events[--event_index] = io->event;
 		}
 
 		DWORD result = WSAWaitForMultipleEvents(event_count, events, FALSE,
timeout_ms, FALSE);

Guus Sliepen

2018-Feb-26 22:01 UTC

head link

tinc 1.1: missing PONG

On Fri, Feb 23, 2018 at 03:04:24PM -0700, Todd C. Miller wrote:
> I've noticed that on Windows during large transfers (such as an
> iperf run), the receiving end sometimes fails to respond to a PING.
> When this happens, the sender closes the connection and both ends
> have to renegotiate.
> 
> My theory is that, due to the properties of the splay tree, the TAP
> device event is almost always at the head of the tree and thus if
> there is traffic, that event will be serviced at the expense of
> meta communication.
> 
> Because splay_search() is used to find the first event returned by
> WSAWaitForMultipleEvents(), the next time through the loop the last
> event serviced will be the first in the events[] array.  This doesn't
> happen with the select() version of the event loop.
> 
> It is possible to eliminate the splay_search() by using an extra
> array to hold io_t pointers but I've had better results just filling
> the events[] array in reverse order.  This results in the most
> recently serviced event being added to the end of events[], which
> should prevent one very busy event from monopolizing the event loop.
The problem is not the order of the events, the problem is that in the
Windows version of the event loop, we only handle one event in each loop
iteration. The select() loop handles all events that have accumulated so
far, so regardless of the order it handles them, it never starves fd. At
least, that was what I thought, until I double checked and found out
that we actually don't in tinc 1.1 (tinc 1.0 is correct though).

So, we have to find a proper fix for both the POSIX and Windows variants
of the event loop in 1.1. For POSIX, the issue is that we might
invalidate the loop iterator due to a callback. It should be easy to
detect that and only exit the loop early if that's really the case. I'll
do that.

For Windows, there are two issues that have to be fixed. First is
that we only handle a single writable IO event. The second is that
WSAWaitForMultipleEvents() only returns one index into the array of
events, even though multiple events might have triggered. How hard would
it be to check for all triggered events?
> With this change I can no longer reproduce the problem of nodes not
> responding to pings.
This will merely move the problem to some other filedescriptor.
> Another solution would to randomly shuffle events[] but that seems
> needlessly complicated.
Yes, that's a bad solution :)

-- 
Met vriendelijke groet / with kind regards,
     Guus Sliepen <guus at tinc-vpn.org>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: not available
URL:
<http://www.tinc-vpn.org/pipermail/tinc-devel/attachments/20180226/6b6bc3d8/attachment.sig>

Todd C. Miller

2018-Feb-26 22:18 UTC

head link

tinc 1.1: missing PONG

On Mon, 26 Feb 2018 23:01:29 +0100, Guus Sliepen wrote:
> The problem is not the order of the events, the problem is that in the
> Windows version of the event loop, we only handle one event in each loop
> iteration. The select() loop handles all events that have accumulated so
> far, so regardless of the order it handles them, it never starves fd. At
> least, that was what I thought, until I double checked and found out
> that we actually don't in tinc 1.1 (tinc 1.0 is correct though).
Sure, but changing the order of the events changes which one will
be in that first slot.
> So, we have to find a proper fix for both the POSIX and Windows variants
> of the event loop in 1.1. For POSIX, the issue is that we might
> invalidate the loop iterator due to a callback. It should be easy to
> detect that and only exit the loop early if that's really the case.
I'll
> do that.
How would you detect that?  By checking whether the next node has
changed?
> For Windows, there are two issues that have to be fixed. First is
> that we only handle a single writable IO event. The second is that
> WSAWaitForMultipleEvents() only returns one index into the array of
> events, even though multiple events might have triggered. How hard would
> it be to check for all triggered events?
I don't think it is too hard.  I was working in that direction
initially but wasn't sure if a larger change would be welcome.
I'll take a stab at it.
> > With this change I can no longer reproduce the problem of nodes not
> > responding to pings.
>
> This will merely move the problem to some other filedescriptor.
Changing the order of the events changes the index of which event
is returned by WSAWaitForMultipleEvents().  Moving the most recently
accessed event to the end gives the others a chance to proceed.

 - todd

Seemingly Similar Threads

Search for more apparently analagous threads

tinc devel - Feb 2018 - tinc 1.1: missing PONG

tinc 1.1: missing PONG

tinc 1.1: missing PONG

tinc 1.1: missing PONG

Seemingly Similar Threads