Upgrading 10.2-RELEASE-p6 to 10.2-RELEASE-p7 now solved ntpd crashes
(apparently fixed by: FreeBSD Errata Notice FreeBSD-EN-15:20.vm).
Thanks!!!
Mark
On 2015-11-01 10:31, Andre Albsmeier wrote:> On Fri, 30-Oct-2015 at 19:47:59 +0100, Mark Martinec wrote:
>> Not sure if it's the same issue, but it sure looks like it is.
>>
>> I have upgraded a couple of hosts (amd64) from 10.2-RELEASE-p5
>> to 10.2-RELEASE-p6, i.e. the freebsd-upgrade essentially just
>> replaced the /usr/sbin/ntpd with a new one; then I restarted
>> the ntpd.
>>
>> On all host but one this was successful: the new ntpd starts
>> fine and works normally. But on one of these machines the
>> ntpd process immediately crashes with SIGSEGV. That machine
>> has an Intel Xeon cpu. It is not apparent to me in what way
>> this machine differs from others,
>
> I'll add my observations here:
>
> I am using an ntp.conf with a single server entry:
>
> server ntp.some.domain.org
>
> ntp.some.domain.org is a CNAME pointing to gate.some.domain.org
> and the latter contains an A record pointing to 192.168.128.1.
>
> After updating 9.3-STABLE to the latest version (one which includes ntp
> 4.2.8p4), ntpd crashes:
>
> Nov 1 09:38:38 voyager kernel: pid 4443 (ntpd), uid 0: exited on signal
> 11
>
> This happens in line 871 of ntpd.c where mlockall() is called:
>
> && 0 != mlockall(MCL_CURRENT|MCL_FUTURE))
>
> It does NOT crash with MCL_FUTURE only.
> It does crash with MCL_CURRENT only.
>
> When adding
>
> rlimit memlock -1
>
> to ntpd.conf it does NOT crash (as mlockall() won't be called anymore).
>
> When specifying the IP address (192.168.128.1) as the server it
> does NOT crash.
>
> When specifying gate.some.domain.org as the server it also does
> NOT crash. tcpdump shows in this case:
>
> 09:49:59.542310 IP 192.168.128.2.21102 > 192.168.128.1.53: 7639+ A?
> gate.some.domain.org. (41)
> 09:49:59.542578 IP 192.168.128.1.53 > 192.168.128.2.21102: 7639* 1/1/0
> A 192.168.128.1 (71)
> 09:49:59.542612 IP 192.168.128.2.52455 > 192.168.128.1.53: 42047+
> AAAA? gate.some.domain.org. (41)
> 09:49:59.542792 IP 192.168.128.1.53 > 192.168.128.2.52455: 42047* 0/1/0
> (88)
>
> When reverting the server entry back to ntp.some.domain.org
> it crashes and tcpdump shows:
>
> 09:36:05.172552 IP 192.168.128.2.17836 > 192.168.128.1.53: 49768+ A?
> ntp.some.domain.org. (40)
> 09:36:05.173320 IP 192.168.128.1.53 > 192.168.128.2.17836: 49768*
> 2/1/0 CNAME gate.some.domain.org., A 192.168.128.1 (89)
> 09:36:05.173361 IP 192.168.128.2.22611 > 192.168.128.1.53: 63808+
> AAAA? ntp.some.domain.org. (40)
> 09:36:05.173595 IP 192.168.128.1.53 > 192.168.128.2.22611: 63808*
> 1/1/0 CNAME gate.some.domain.org. (106)
>
> The probability for crashing increases with the speed and the
> number of cores of the machine: On my old single-core Pentiums
> it never crashes, on my quad-cores i7-3770K it always crashes.
>
> The (asynchronous) resolving of the names start in line 3876 of
> ntp_config.c:
>
> getaddrinfo_sometime(curr_peer->addr->address,
>
> If we put the mlockall() call directly before this line, the
> crash is gone.
>
> Maybe you want to play around with rlimit, CNAMES, IPs and
> so on...
>
> -Andre
>
> Anyone else seeing this?
>> 2015-10-30 12:34, je David Wolfskill napisal
>> > On Fri, Oct 30, 2015 at 09:42:07AM +0100, Dag-Erling Sm?rgrav
wrote:
>> >> David Wolfskill <david at catwhisker.org> writes:
>> >> > ...
>> >> > bound to 172.17.1.245 -- renewal in 43200 seconds.
>> >> > pid 544 (ntpd), uid 0: exited on signal 11 (core dumped)
>> >> > Starting Network: lo0 em0 iwn0 lagg0.
>> >> > ...
>> >>
>> >> Did you find a solution? I'm wondering if the ntpd
problems people
>> >> are
>> >> reporting on freebsd-security@ are related. I vaguely recall
hearing
>> >> that this had been traced to a pthread bug, but can't find
anything
>> >> about it in commit logs or mailing list archives.
>> >> ....
>> >
>> > I don't recall finding "a solution" per se; that
said, I also don't
>> > recall seeing an occurrence of the above for enough time that
I'm not
>> > sure when I sent that message. :-}
>> >
>> > As a reality check:
>> >
>> > g1-252(11.0-C)[1] ls -lT /*.core
>> > -rw-r--r-- 1 root wheel 13783040 Aug 18 04:19:03 2015
/ntpd.core
>> > g1-252(11.0-C)[2]
>> >
>> > So -- among other points -- my last sighting of whatever was
causing
>> > that was the day I built:
>> >
>> > FreeBSD 11.0-CURRENT #157 r286880M/286880:1100079: Tue Aug 18
>> > 04:45:25 PDT 2015
>> > root at g1-252.catwhisker.org:/common/S4/obj/usr/src/sys/CANARY
amd64
>> >
>> > Note that the machines where I run head get updated daily (unless
>> > there's enough of a problem with head that I can't build
it or can't
>> > boot it (and I'm unable to circumvent the issue within a
reasonable
>> > time)) -- and while I do attempt to run ntpd on the machines, the
above
>> > failure is more "annoying" than "crippling" in
my particular case.
>> >
>> > And I'm presently running:
>> >
>> > FreeBSD 11.0-CURRENT #227 r290138M/290138:1100084: Thu Oct 29
>> > 05:12:58 PDT 2015
>> > root at g1-252.catwhisker.org:/common/S4/obj/usr/src/sys/CANARY
amd64
>> >
>> > and building head @r290190 as I type.
>> >
>> > And FWIW, I *suspect* that one of the issues involved (in my case)
>> > was a ... lack of determinism ... in events involving getting the
>> > (wireless) network connectivity into a usable state as part of the
>> > initial transition to multi-user mode. (I only have evidence at
>> > the moment of the issue on my laptop; my build machine, which only
>> > uses a wired NIC, has no /ntpd.core file. It and my laptop are
updated
>> > pretty much in lock-step; it runs a completely GENERIC kernel,
while
>> > the laptop runs a modestly customized one based on GENERIC.)
>> >
>> > Peace,
>> > david
>> _______________________________________________
>> freebsd-stable at freebsd.org mailing list
>> https://lists.freebsd.org/mailman/listinfo/freebsd-stable
>> To unsubscribe, send any mail to
>> "freebsd-stable-unsubscribe at freebsd.org"