thr3ads.net - freebsd stable - ntp problems stratum 2 to 14? [Mar 2020]

If this information is useful, please help other people find it:
Share via:

Ian Lepore

2020-Mar-05 18:19 UTC

ntp problems stratum 2 to 14?

On Wed, 2020-02-26 at 16:37 +1100, Dewayne Geraghty
wrote:> I usually run ntpd with both aslr and as user ntpd.  While testing I
> noticed that my server with a direct network cable to my main time keeper,
> jumped from the expected stratum 2 to 14 as follows (I record the date so I
> can synch with the debug log, also below):
> 
> vm.loadavg={ 0.09 0.10 0.18 }
> 
> Wed 26 Feb 2020 15:16:38 AEDT
>      remote           refid      st t when poll reach   delay   offset
>  jitter
>
=============================================================================>
10.0.7.6        203.35.83.242    2 u   44   64  377    0.147  -227.12
>  33.560
> *127.127.1.1     .LOCL.          14 l   59  128  377    0.000    0.000
> 0.000
> Wed 26 Feb 2020 15:18:46 AEDT
>      remote           refid      st t when poll reach   delay   offset
>  jitter
>
=============================================================================>
10.0.7.6        LOCAL(1)        14 u   42   64  377    0.147  -227.12
>  44.529
> *127.127.1.1     .LOCL.          14 l   59  128  377    0.000    0.000
> 0.000
> Wed 26 Feb 2020 15:20:54 AEDT
>      remote           refid      st t when poll reach   delay   offset
>  jitter
>
=============================================================================>
10.0.7.6        LOCAL(1)        14 u   42   64  377    0.147  -227.12
>  73.969
> *127.127.1.1     .LOCL.          14 l   59  128  377    0.000    0.000
> 0.000
> Wed 26 Feb 2020 15:23:02 AEDT
>      remote           refid      st t when poll reach   delay   offset
>  jitter
>
=============================================================================>
*10.0.7.6        LOCAL(1)        14 u   37   64  377    0.164  -370.64
>  74.119
>  127.127.1.1     .LOCL.          14 l   59  128  377    0.000    0.000
> 0.000
> Time marches on
> Wed 26 Feb 2020 16:03:35 AEDT
>      remote           refid      st t when poll reach   delay   offset
>  jitter
>
=============================================================================>
*10.0.7.6        LOCAL(1)        14 u   11   64  177    0.133   -3.148
>  72.295
>  127.127.1.1     .LOCL.          14 l  406  128   10    0.000    0.000
> 0.000
> Wed 26 Feb 2020 16:05:43 AEDT
>      remote           refid      st t when poll reach   delay   offset
>  jitter
>
=============================================================================>
*10.0.7.6        203.35.83.242    2 u    7   64  377    0.164  -42.789
>  73.762
>  127.127.1.1     .LOCL.          14 l  534  128   20    0.000    0.000
> 0.000
> 
> The debug for the above is:
> 26 Feb 14:58:33 ntpd[8772]: Command line: /usr/local/sbin/ntpd -c
> /etc/ntp.conf -g -g -u ntpd --nofork
> ...
> 26 Feb 14:58:34 ntpd[8772]: 10.0.7.6 e014 84 reachable
> 26 Feb 14:58:35 ntpd[8772]: LOCAL(1) 8014 84 reachable
> 26 Feb 15:03:40 ntpd[8772]: LOCAL(1) 901a 8a sys_peer <== bad
> 26 Feb 15:03:40 ntpd[8772]: 0.0.0.0 c515 05 clock_sync
> 26 Feb 15:22:25 ntpd[8772]: 10.0.7.6 f01a 8a sys_peer  <=== Good!
> 26 Feb 15:22:25 ntpd[8772]: 0.0.0.0 0613 03 spike_detect -0.370644 s
> 26 Feb 15:30:03 ntpd[8772]: 0.0.0.0 061c 0c clock_step -0.536289 s
> 26 Feb 15:30:02 ntpd[8772]: 0.0.0.0 0615 05 clock_sync
> 26 Feb 15:30:03 ntpd[8772]: 0.0.0.0 c618 08 no_sys_peer
> 26 Feb 15:30:03 ntpd[8772]: 10.0.7.6 e014 84 reachable
> 26 Feb 15:30:07 ntpd[8772]: LOCAL(1) 8014 84 reachable
> 26 Feb 15:30:21 ntpd[8772]: 10.0.7.6 f01a 8a sys_peer
> ...
> 26 Feb 15:46:49 ntpd[8772]: 0.0.0.0 c618 08 no_sys_peer
> 26 Feb 15:46:57 ntpd[8772]: 10.0.7.6 f01a 8a sys_peer
> 
> ...
> 26 Feb 15:56:58 ntpd[8772]: 10.0.7.6 f01a 8a sys_peer
> ...
> 26 Feb 16:24:33 ntpd[8772]: LOCAL(1) 901a 8a sys_peer <== and stays
LOCAL
> which is now normal for this box  :(
> 
> Should the jump to stratum 14 be expected?  Anything obviously wrong with
> the ntp.conf?
> 
> I've had a few days of testing on what is usually a very stable
(time-wise
> system), seems that running at prio 20 is required.
> 
> /etc/ntp.conf contains
> rlimit memlock -1
> rlimit filenum 32
> driftfile /var/db/ntp/drift
> disable bclient
> server 10.0.7.6 iburst minpoll 4 maxpoll 6 version 4 key 23057 prefer
> 
> server 127.127.1.1 minpoll 7 maxpoll 7
> fudge  127.127.1.1 stratum 14
> 
> restrict -4 default ignore
> restrict -6 default ignore
> restrict 127.0.0.1  nomodify nopeer notrap
> restrict -6 ::1     nomodify nopeer notrap
> restrict 0.0.0.0 ignore
> 
> restrict 10.0.7.6 nomodify nopeer noquery notrap ntpport
> restrict 10.169.168.91 mask 255.255.255.0 nomodify nopeer noquery notrap
> ntpport kod limited
> 
> 
> I'm also very surprised that the jitter on the server (under testing)
is so
> poor.  The internet facing time server is
> *x.y.z.t   .ATOM.           1 u   73  512    7   23.776   34.905  95.961
> but its very old and not running aslr.
> 
> Any ideas or pointers would be appreciated.  This is very, time consuming.
> :)
> 
> I'm using the following command sequence as these are all being changed
> sysctl kern.elf64.aslr.enable=1 kern.elf64.aslr.stack_gap=1
> security.mac.ntpd.enabled=1 && \
> /usr/bin/proccontrol -m aslr -s disable /usr/local/sbin/ntpdate -v -a
> 23057  -k /etc/ntp.keys 10.0.7.6 && sleep 2 && \
> /rescue/nice -n -20  /usr/bin/proccontrol -m aslr -s disable
>  /usr/local/sbin/ntpd -c /etc/ntp.conf -g -g  -u ntpd --nofork
> 
> I get similar results with /usr/sbin/ntpd, I've been testing both and
> happened to record details for the port ntpd.
> 
> Regards, Dewayne
> 
Using a local clock is a bad bad idea.  I'm not sure what problem you
think it solves, but I'm fairly sure that configuring local clocks is
at the root of your problems.  The only valid configuration that
includes a local clock is when some external mechanism other than ntp
is disciplining the kernel clock and ntpd is being used only to monitor
that performance and serve time to others.

Your configuration is almost the perfect setup for failure:  one
unreliable network clock, and the local clock.  Ntpd's strong point is
being able to select reliable servers from a collection of candidates. 
When you reduce its choices to just two, there is no way it can make a
correct choice.

The sequence you show above starts out with 10.0.7.6 sync'd to another
network server, with an offset of 227ms, making it a bad candidate as
system peer.  Two minutes later 10.0.7.6 has switched itself to its
local clock and dropped to stratum 14; ntpd is not going to choose a
remote stratum 14 server over the stratum 14 local clock.

Eventually 10.0.7.6 steps its clock and resumes operation at stratum 2.
Shortly after that, the problem system follows by stepping its clock
then switching the system peer to 10.0.7.6.

You're going to keep having these kinds of problems as long as local
clocks are configured.  Especially since multiple servers in your
syncronization net are configured that way, and don't have enough peers
to sanity-check the times.

-- Ian

Dewayne Geraghty

2020-Mar-05 20:49 UTC

head link

ntp problems stratum 2 to 14?

Ian,  Good points.  I did remove the fudge and 127.127.1.1 lines from the
config with the same result as below.  Interestingly the clock at 10.0.7.6
isn't really unreliable, as its been my time source since 2005, and serves
clients, so it is pretty ok. Without a local clock, named fails (its linked
with kerberos).  Yes I'd never seen my clock server become st 14, which
prompted me to seek help. I haven't repeated that scenario, but I
continuously get "interrupted system call" and I haven't been able
to sync
while running debug or ktrace - so I dont have "what good looks like".

Thanks Peter. I've rebuilt net/ntpd in various ways including all
defaults.  Rebuilt the kernel (& world), to the latest 12 stable.
Reset almost all sysctl's (I change 91 of them).  I keep getting
interrupted system call at 1 sec intervals, which I suspect is a problem

For the reader: a stratum 2 clocks 10.0.7.6 syncs with some atomic clocks
within city; a server 10.0.7.91 running ntpd 4.2.8p14 on FreeBSD 12.1
r358565M irregularly and usually wont sync, and experiences "interrupted
system calls".

The jump to stratum 14 was a surprise, but not repeatable.  Sometimes the
ntpd port starts and uses the next hop time server, but usually within 20
minutes returns to LOCL though more often goes straight to LOCL.  During
ktrace's I've observed:

 66894 ntpd     CALL  write(0x1,0x80078e000,0x2e)
 66894 ntpd     GIO   fd 1 wrote 46 bytes
       "select() returned -1: Interrupted system call
...
       "poll_update: at 1 10.0.7.6 poll 4 burst 0 retry 2 head 14 early 2
next 16
       "
 66894 ntpd     RET   write 74/0x4a
 66894 ntpd     CALL  select(0x19,0x7fffffffde50,0,0,0)
 66894 ntpd     RET   select -1 errno 4 Interrupted system call
 66894 ntpd     PSIG  SIGALRM caught handler=0x80072f600 mask=0x0
code=SI_KERNEL
 66894 ntpd     CALL  sigprocmask(SIG_SETMASK,0x7fffffffd7a4,0)
 66894 ntpd     RET   sigprocmask 0
 66894 ntpd     CALL  sigreturn(0x7fffffffd3d0)
 66894 ntpd     RET   sigreturn JUSTRETURN
 66894 ntpd     CALL  write(0x1,0x80078e000,0x2e)
 66894 ntpd     GIO   fd 1 wrote 46 bytes
       "select() returned -1: Interrupted system call
...
       "select() returned -1: Interrupted system call
       "
 66894 ntpd     RET   write 46/0x2e
 66894 ntpd     CALL  select(0x19,0x7fffffffde50,0,0,0)
 66894 ntpd     RET   select -1 errno 4 Interrupted system call
 66894 ntpd     PSIG  SIGALRM caught handler=0x80072f600 mask=0x0
code=SI_KERNEL
 66894 ntpd     CALL  sigprocmask(SIG_SETMASK,0x7fffffffd7a4,0)
 66894 ntpd     RET   sigprocmask 0
 66894 ntpd     CALL  sigreturn(0x7fffffffd3d0)
 66894 ntpd     RET   sigreturn JUSTRETURN
 66894 ntpd     CALL  write(0x1,0x80078e000,0x2e)
 66894 ntpd     GIO   fd 1 wrote 46 bytes
       "select() returned -1: Interrupted system call
but I have no idea whether or not these interrupted system calls are normal
or not.

and with debug (-D5) this is what I see

 5 Mar 18:26:50 ntpd[86274]: select(): nfound=-1, error: Interrupted system
call
alarming: normal
 5 Mar 18:26:51 ntpd[86274]: select(): nfound=-1, error: Interrupted system
call
poll_update: at 17 10.0.7.6 poll 4 burst 0 retry 0 head 0 early 2 next 16
sendpkt(21, dst=10.0.7.6, src=10.0.7.91, ttl=0, len=72)
peer_xmit: at 17 10.0.7.91->10.0.7.6 mode 3 keyid 0000232f len 72 index 0
read_network_packet: fd=21 length 72 from 10.0.7.6
fetch_timestamp: system bintime network time stamp: 1583393211.408612992
restrictions: looking up 10.0.7.6
match_restrict4_addr: Checking 127.0.0.1, port 123 ... doesn't match:
ippeerlimit -4
match_restrict4_addr: Checking 127.0.0.1, port 123 ... doesn't match:
ippeerlimit -1
match_restrict4_addr: Checking 10.0.7.91, port 123 ... doesn't match:
ippeerlimit -4
match_restrict4_addr: Checking 10.0.7.6, port 123 ... MATCH: ippeerlimit -1
receive: at 17 10.0.7.91<-10.0.7.6 ippeerlimit -1 mode 4 iflags
up,broadcast,multicast restrict nomodify,nopeer,noquery,notrap org
0xe20b283b.687ad7f6 xmt 0xe20b2838.2001aa5b
MRU: interval 16 headway 8 limit 64
receive: at 17 10.0.7.91<-10.0.7.6 mode 4/server:AM_PROCPKT keyid 0000232f
len 72 auth 1 org 0xe20b283b.687ad7f6 xmt 0xe20b2838.2001aa5b MAC
receive: MATCH_ASSOC dispatch: mode 4/server:AM_PROCPKT
poll_update: at 17 10.0.7.6 poll 4 burst 0 retry 0 head 14 early 2 next 16
clock_filter: n 2 off -3.283397 del 0.000382 dsp 3.937561 jit 0.000551
alarming: normal
 5 Mar 18:26:52 ntpd[86274]: select(): nfound=-1, error: Interrupted system
call
alarming: normal
 5 Mar 18:26:53 ntpd[86274]: select(): nfound=-1, error: Interrupted system
call
alarming: normal

I have rebuild with ntp-4.2.8p14 and with no additional compile rules and
the port's defaults. I keep getting

 6 Mar 07:05:16 ntpd[98682]: select(): nfound=-1, error: Interrupted system
call
alarming: normal
 6 Mar 07:05:17 ntpd[98682]: select(): nfound=-1, error: Interrupted system
call
alarming: normal
 6 Mar 07:05:18 ntpd[98682]: select(): nfound=-1, error: Interrupted system
call
alarming: normal
 6 Mar 07:05:19 ntpd[98682]: select(): nfound=-1, error: Interrupted system
call
alarming: normal
 6 Mar 07:05:20 ntpd[98682]: select(): nfound=-1, error: Interrupted system
call

So with compile, minimum settings and all aslr variables set to 0.  We get
the above result running as either ntpd or root.  I'm starting to wonder
about the network stack?

My build comprises:
# make __MAKE_CONF=/etc/make_test.conf -C /usr/ports/net/ntp -VCFLAGS
-VLDFLAGS
-O2 -pipe  -fstack-protector-strong -fno-strict-aliasing
  -fstack-protector
# make __MAKE_CONF=/etc/make_test.conf -C /usr/ports/net/ntp showconfig
|grep =on
     DEBUG=on: Build with debugging support
     DOCS=on: Build and/or install documentation
     EXAMPLES=on: Build and/or install examples
     IPV6=on: IPv6 protocol support
     LOCAL_CLOCK=on: Enable local clock reference
     NLS=on: Native Language Support
     PERL_UTILS=on: Install ntp-wait and ntptrace utilities
     SSL=on: SSL protocol support
     THREADS=on: Threading support

Also the base systems ntp 4.2.8p13 also experiences "interrupted system
calls"
 80370 ntpd     CALL  select(0x19,0x7fffffffde40,0,0,0)
 80370 ntpd     RET   select -1 errno 4 Interrupted system call
 80370 ntpd     PSIG  SIGALRM caught handler=0x801473610 mask=0x0
code=SI_KERNEL
 80370 ntpd     CALL  sigprocmask(SIG_SETMASK,0x7fffffffd7a4,0)
 80370 ntpd     RET   sigprocmask 0
 80370 ntpd     CALL  sigreturn(0x7fffffffd3d0)
 80370 ntpd     RET   sigreturn JUSTRETURN
 80370 ntpd     CALL  select(0x19,0x7fffffffde40,0,0,0)
 80370 ntpd     RET   select -1 errno 4 Interrupted system call
; perhaps just having aslr built into the kernel?? Its straw clutching
time...

I appreciate your consideration.  All suggestions welcome. :)
PS Ian the main time server 10.0.7.6 sync's off 2x atomic clocks, with 2x
GPS as failover ;)  The internal server 10.0.7.91 serves the internal
clients only.

freebsd stable - Mar 2020 - ntp problems stratum 2 to 14?

ntp problems stratum 2 to 14?

ntp problems stratum 2 to 14?