On 06.10.2011 15:49, Michael Tokarev wrote:> Hello.
>
> I'm trying to diagnose a problem in our network which - apparently -
> started after I enabled DNSSEC on our local zones (island of security),
> but may be unrelated as well.
>
> The problem is that there's frequent messages in the logs on the
mahcine
> running nsd:
>
> Oct 6 14:32:32 mother nsd[1363]: failed writing to tcp: Broken pipe
Here's one strace of the issue:
1363 pselect6(9, [3 4 8], [], [], NULL, {NULL, 8}) = 1 (in [4])
1363 accept(4, {sa_family=AF_INET, sin_port=htons(58332),
sin_addr=inet_addr("192.168.177.5")}, [16]) = 7
1363 fcntl64(7, F_SETFL, O_RDONLY|O_NONBLOCK) = 0
1363 brk(0xf8fd9000) = 0xf8fd9000
1363 gettimeofday({1317902272, 86998}, NULL) = 0
1363 gettimeofday({1317902272, 87060}, NULL) = 0
1363 pselect6(9, [3 4 7 8], [], [], {119, 999938000}, {NULL, 8}) = 1 (in [7],
left {119, 999934221})
1363 read(7, "\0\34", 2) = 2
1363 read(7, "\32\333\0\0\0\1\0\0\0\0\0\0\3tls\3msk\2ru\0\0\374\0\1",
28) = 28
As far as I see, this is a plain AXFR IN request for tls.msk.ru domain,
made by 192.168.177.5, which runs both unbound and nsd. I can guess it
was nsd asking for AXFR.
1363 gettimeofday({1317902272, 87498}, NULL) = 0
1363 gettimeofday({1317902272, 87557}, NULL) = 0
1363 pselect6(9, [3 4 8], [7], [], {119, 999941000}, {NULL, 8}) = 1 (out [7],
left {119, 999938160})
1363 write(7, "?\325", 2) = 2
1363 write(7,
"\32\333\200\0\0\1\0\242\0\0\0\0\3tls\3msk\2ru\0\0\374\0\1\300\f\0\6"...,
16341) = 11584
1363 gettimeofday({1317902272, 88002}, NULL) = 0
1363 pselect6(9, [3 4 8], [7], [], {119, 999496000}, {NULL, 8}) = 1 (out [7],
left {119, 999493205})
1363 write(7,
"\237\v\270U\235$\350\302\376\16\274\201\5\204\336\237\230;\16\330#\237\367\327\3330\203\310DM\33H"...,
4757) = 4757
1363 gettimeofday({1317902272, 88362}, NULL) = 0
1363 gettimeofday({1317902272, 88424}, NULL) = 0
1363 pselect6(9, [3 4 8], [7], [], {119, 999938000}, {NULL, 8}) = 1 (out [7],
left {119, 999930669})
1363 write(7, "?\372", 2) = 2
1363 write(7,
"\32\333\200\0\0\0\0\241\0\0\0\0\5f0009\3tls\3msk\2ru\0\0."..., 16378)
= 16378
1363 gettimeofday({1317902272, 88932}, NULL) = 0
1363 gettimeofday({1317902272, 89000}, NULL) = 0
1363 pselect6(9, [3 4 8], [7], [], {119, 999932000}, {NULL, 8}) = 1 (out [7],
left {119, 999145123})
1363 write(7, "?r", 2) = -1 EPIPE (Broken pipe)
1363 --- SIGPIPE (Broken pipe) @ 0 (0) ---
1363 time(NULL) = 1317902272
1363 send(5, "<27>Oct 6 15:57:52 nsd[1363]: f"..., 65,
MSG_NOSIGNAL) = 65
But after sending 2 + 11584 + 4757 + 2 + 16378 bytes, the remote
gave up and closed the connection. Again, I can only guess it
was nsd who decided that it does not want the rest of data, most
likely due to the fact that it have that SOA already.
The SOA record for this zone is:
tls.msk.ru. SOA mother.tls.msk.ru. mjt.tls.msk.ru. 2011100501 21600 600
1209600 86400
which gives 6h refresh and 10m retry interval, and the latter matches
with the logged pattern:
> Oct 6 14:41:58 mother nsd[1363]: failed writing to tcp: Broken pipe
> Oct 6 14:41:58 mother nsd[1363]: failed writing to tcp: Broken pipe
> Oct 6 14:51:02 mother nsd[1363]: failed writing to tcp: Broken pipe
> Oct 6 15:00:07 mother nsd[1363]: failed writing to tcp: Broken pipe
> Oct 6 15:00:07 mother nsd[1363]: failed writing to tcp: Broken pipe
> Oct 6 15:09:37 mother nsd[1363]: failed writing to tcp: Broken pipe
> Oct 6 15:09:37 mother last message repeated 2 times
> Oct 6 15:19:35 mother nsd[1363]: failed writing to tcp: Broken pipe
> Oct 6 15:19:35 mother nsd[1363]: failed writing to tcp: Broken pipe
14:41, 14:51, 15:00, 15:09, 15:19, ... - about every 10 minutes.
> There are just handful other machines accessing this service (it is a
> hidden primary), either other instances of nsd or unbound servers.
> As far as I can tell, the "bad" connections logged by nsd comes
from
> unbound servers - running tcpdump right now.
There are 3 other machines accessing this server "by" nsd (and one of
them is 192.168.177.5).
>
> But the thing is: the above error message is mostly useless, it is
> just a "random noize" which - IMHO anyway - should either tell
more
> (at least, where the connection comes from) or should be dropped
> completely.
>
> In the code, there's a check for ECONNRESET - should it be expanded
> to EPIPE as well?
Note that in the strace there are other cases like this, but with ECONNRESET
error instead of EPIPE:
1363 accept(4, {sa_family=AF_INET, sin_port=htons(43260),
sin_addr=inet_addr("192.168.19.2")}, [16]) = 7
1363 fcntl64(7, F_SETFL, O_RDONLY|O_NONBLOCK) = 0
1363 brk(0xf8fd9000) = 0xf8fd9000
1363 gettimeofday({1317902225, 62655}, NULL) = 0
1363 gettimeofday({1317902225, 62736}, NULL) = 0
1363 pselect6(9, [3 4 7 8], [], [], {119, 999919000}, {NULL, 8}) = 1 (in [7],
left {119, 999910531})
1363 read(7, "\0\34", 2) = 2
1363 read(7, 0xf8f98118, 28) = -1 EAGAIN (Resource temporarily
unavailable)
1363 gettimeofday({1317902225, 63112}, NULL) = 0
1363 pselect6(9, [3 4 7 8], [], [], {119, 999543000}, {NULL, 8}) = 1 (in [7],
left {119, 982674025})
1363 read(7, "%\272\0\0\0\1\0\0\0\0\0\0\3tls\3msk\2ru\0\0\374\0\1",
28) = 28
1363 gettimeofday({1317902225, 80761}, NULL) = 0
1363 gettimeofday({1317902225, 80823}, NULL) = 0
1363 pselect6(9, [3 4 8], [7], [], {119, 999938000}, {NULL, 8}) = 1 (out [7],
left {119, 999934106})
1363 write(7, "?\325", 2) = 2
1363 write(7,
"%\272\200\0\0\1\0\242\0\0\0\0\3tls\3msk\2ru\0\0\374\0\1\300\f\0\6"...,
16341) = 10792
1363 gettimeofday({1317902225, 81210}, NULL) = 0
1363 pselect6(9, [3 4 8], [7], [], {119, 999551000}, {NULL, 8}) = 1 (out [7],
left {119, 983838077})
1363 write(7,
"\310\211\17\251\355\17\366\257\275Tm\27\350\312\0/\0\1\0\1Q\200\0\35\3NET\4deo"...,
5549) = 5549
1363 gettimeofday({1317902225, 97713}, NULL) = 0
1363 gettimeofday({1317902225, 97776}, NULL) = 0
1363 pselect6(9, [3 4 8], [7], [], {119, 999937000}, {NULL, 8}) = 1 (out [7],
left {119, 999933714})
1363 write(7, "?\372", 2) = 2
1363 write(7,
"%\272\200\0\0\0\0\241\0\0\0\0\5f0009\3tls\3msk\2ru\0\0."..., 16378) =
11986
1363 gettimeofday({1317902225, 98069}, NULL) = 0
1363 pselect6(9, [3 4 8], [7], [], {119, 999644000}, {NULL, 8}) = 1 (out [7],
left {119, 962512462})
1363 write(7,
"\241\320[*\272\201\20\247\331\351\277\334\221\34=\200\313\226\330\203\215\236R7\333\312\f\340{\376R\360"...,
4392) = 4392
1363 gettimeofday({1317902225, 136042}, NULL) = 0
1363 gettimeofday({1317902225, 136101}, NULL) = 0
1363 pselect6(9, [3 4 8], [7], [], {119, 999941000}, {NULL, 8}) = 1 (out [7],
left {119, 999936949})
1363 write(7, "?r", 2) = 2
1363 write(7,
"%\272\200\0\0\0\0\242\0\0\0\0\5f0055\3tls\3msk\2ru\0\0."..., 16242) =
16242
1363 gettimeofday({1317902225, 136481}, NULL) = 0
1363 gettimeofday({1317902225, 136537}, NULL) = 0
1363 pselect6(9, [3 4 8], [7], [], {119, 999944000}, {NULL, 8}) = 1 (out [7],
left {119, 972729681})
1363 write(7, "?\366", 2) = -1 ECONNRESET (Connection
reset by peer)
So the two at least _may_ be treated the same.
But should the "client" NSD be doing like that? Maybe it is better
to query for SOA first, instead of asking AXFR right away, and just
closing the connection when already up to date?
> And, where - in server.c:handle_tcp_writing() - one can get the peer
> address for handler->fd, short of calling getpeername()? I mean, is
> the peer address available somewhere in the connection structures in
> a ready to use form?
Thanks,
/mjt