We have a Zenoss server in our main office monitoring (among many other
things) an Apache server in a remote network, with a Tinc link between
the two networks. The monitoring simply involves making an HTTP request
to a URL once every 5 minutes and confirming that a response page comes
back.
Most of the requests to this particular web server succeed (and similar
requests to other web servers don't have any problems), but several
times a day one of the requests will fail, causing Zenoss to generate an
alert... only to have the next request succeed and the alert clear.
Looking closely through various logs and running tcpdump on the two Tinc
servers, I discovered that Tinc seems to be munging the mss value on the
packets that the remote server is sending back, and in the process
(sometimes) generating an incorrect packet cksum, thus causing the
kernel on the local Tinc server to decide the packet is invalid (which
in turn caused it to get dropped by the iptable firewall rules there).
For example, here's the tcpdump output for a return packet as seen going
into the tun interface on the remote tinc server:
2015-09-09 11:42:13.076518 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF],
proto TCP (6), length 60)
{webserver}.http > {zenoss-server}.42319: Flags [S.], cksum 0xe771
(correct), seq 1243683600, ack 2163080236, win 26847, options [mss
8961,sackOK,TS val 140685120 ecr 136010701,nop,wscale 7], length 0
0x0000: 4500 003c 0000 4000 3f06 04e1 0a50 0070 E..<.. at .?....P.p
0x0010: ac12 8009 0050 a54f 4a21 1b10 80ed fc2c .....P.OJ!.....,
0x0020: a012 68df e771 0000 0204 2301 0402 080a ..h..q....#.....
0x0030: 0862 af40 081b 5bcd 0103 0307 .b. at ..[.....
While here is that same packet coming out of the tun interface on the
local Tinc server:
2015-09-09 11:42:15.094332 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF],
proto TCP (6), length 60)
{webserver}.http > {zenoss-server}.42319: Flags [S.], cksum 0x0301
(incorrect -> 0x0302), seq 1243683600, ack 2163080236, win 26847, options
[mss 1405,sackOK,TS val 140685620 ecr 136010701,nop,wscale 7], length 0
0x0000: 4500 003c 0000 4000 3f06 04e1 0a50 0070 E..<.. at .?....P.p
0x0010: ac12 8009 0050 a54f 4a21 1b10 80ed fc2c .....P.OJ!.....,
0x0020: a012 68df 0301 0000 0204 057d 0402 080a ..h........}....
0x0030: 0862 b134 081b 5bcd 0103 0307 .b.4..[.....
That packet then causes the following kern.log message on that Tinc server:
2015-09-09T11:42:13.097949-04:00 {tinc-server} kern.notice kernel:
[22743995.678025] nf_ct_tcp: bad TCP checksum IN= OUT= SRC=10.80.0.112
DST=172.18.128.9 LEN=60 TOS=0x00 PREC=0x00 TTL=63 ID=0 DF PROTO=TCP SPT=80
DPT=42319 SEQ=1243683600 ACK=2163080236 WINDOW=26847 RES=0x00 ACK SYN URGP=0 OPT
(0204057D0402080A0862AF40081B5BCD01030307)
In cases where Zenoss ends up complaining, it appears that all the
"Flag: [S.]" reply packets from the webserver end up with an incorrect
checksum, while in most cases Tinc at least eventually generates a valid
reply packet and the TCP session proceeds successfully. (For example,
the packet above was followed by another incorrect packet, then one that
succeeded:
2015-09-09 11:42:15.098603 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF],
proto TCP (6), length 60)
{webserver}.http > {zenoss-server}.42319: Flags [S.], cksum 0x0300
(incorrect -> 0x0301), seq 1243683600, ack 2163080236, win 26847, options
[mss 1405,sackOK,TS val 140685621 ecr 136010701,nop,wscale 7], length 0
0x0000: 4500 003c 0000 4000 3f06 04e1 0a50 0070 E..<.. at .?....P.p
0x0010: ac12 8009 0050 a54f 4a21 1b10 80ed fc2c .....P.OJ!.....,
0x0020: a012 68df 0300 0000 0204 057d 0402 080a ..h........}....
0x0030: 0862 b135 081b 5bcd 0103 0307 .b.5..[.....
[...]
2015-09-09 11:42:19.106462 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF],
proto TCP (6), length 60)
{webserver}.http > {zenoss-server}.42319: Flags [S.], cksum 0xff16
(correct), seq 1243683600, ack 2163080236, win 26847, options [mss
1405,sackOK,TS val 140686623 ecr 136010701,nop,wscale 7], length 0
0x0000: 4500 003c 0000 4000 3f06 04e1 0a50 0070 E..<.. at .?....P.p
0x0010: ac12 8009 0050 a54f 4a21 1b10 80ed fc2c .....P.OJ!.....,
0x0020: a012 68df ff16 0000 0204 057d 0402 080a ..h........}....
0x0030: 0862 b51f 081b 5bcd 0103 0307 .b....[.....
)
In all cases (correct and incorrect cksum), the reply packet goes in
to the remote Tinc instance with an mss of 8961 and comes out of the
local instance with that changed to 1405.
It seems that when the cksum is incorrect, it is always off by 1.
(The related packets generated on the Zenoss side start out with an mss
of 1460 so presumable Tinc doesn't edit the packets going in that
direction; I have not found the "incorrect ->" message in the
tcpdump
output on the remote Tinc server.)
Currently we are running Ubuntu Precise for both Tinc servers, so we
have tinc v1.0.16 installed.
Am I correct in concluding that this cksum problem is a bug in Tinc?
If so, is it a known bug that has been corrected in some later Tinc
release?
(I looked through release announcements, the git commit log, and list
archives but didn't immediately see anything that appeared to be
related....)
Thanks.
Nathan
----------------------------------------------------------------------------
Nathan Stratton Treadway - nathanst at ontko.com - Mid-Atlantic region
Ray Ontko & Co. - Software consulting services - http://www.ontko.com/
GPG Key: http://www.ontko.com/~nathanst/gpg_key.txt ID: 1023D/ECFB6239
Key fingerprint = 6AD8 485E 20B9 5C71 231C 0C32 15F3 ADCD ECFB 6239