thr3ads.net - freebsd stable - huge nanosleep variance on 11-stable [Nov 2016]

If this information is useful, please help other people find it:
Share via:

Kevin Oberman

2016-Nov-02 03:45 UTC

huge nanosleep variance on 11-stable

On Tue, Nov 1, 2016 at 2:36 PM, Jason Harmening <jason.harmening at
gmail.com>
wrote:
> Sorry, that should be ~*30ms* to get 30fps, though the variance is still
> up to 500ms for me either way.
>
> On 11/01/16 14:29, Jason Harmening wrote:
> > repro code is at http://pastebin.com/B68N4AFY if anyone's
interested.
> >
> > On 11/01/16 13:58, Jason Harmening wrote:
> >> Hi everyone,
> >>
> >> I recently upgraded my main amd64 server from 10.3-stable
(r302011) to
> >> 11.0-stable (r308099).  It went smoothly except for one big issue:
> >> certain applications (but not the system as a whole) respond very
> >> sluggishly, and video playback of any kind is extremely choppy.
> >>
> >> The system is under very light load, and I see no evidence of
abnormal
> >> interrupt latency or interrupt load.  More interestingly, if I
place the
> >> system under full load (~0.0% idle) the problem *disappears* and
> >> playback/responsiveness are smooth and quick.
> >>
> >> Running ktrace on some of the affected apps points me at the
problem:
> >> huge variance in the amount of time spent in the nanosleep system
call.
> >> A sleep of, say, 5ms might take anywhere from 5ms to ~500ms from
entry
> >> to return of the syscall.  OTOH, anything CPU-bound or that waits
on
> >> condvars or I/O interrupts seems to work fine, so this doesn't
seem to
> >> be an issue with overall system latency.
> >>
> >> I can repro this with a simple program that just does a 3ms usleep
in a
> >> tight loop (i.e. roughly the amount of time a video player would
sleep
> >> between frames @ 30fps).  At light load ktrace will show the huge
> >> nanosleep variance; under heavy load every nanosleep will complete
in
> >> almost exactly 3ms.
> >>
> >> FWIW, I don't see this on -current, although right now all my
-current
> >> images are VMs on different HW so that might not mean anything. 
I'm not
> >> aware of any recent timer- or scheduler- specific changes, so
I'm
> >> wondering if perhaps the recent IPI or taskqueue changes might be
> >> somehow to blame.
> >>
> >> I'm not especially familiar w/ the relevant parts of the
kernel, so any
> >> guidance on where I should focus my debugging efforts would be
much
> >> appreciated.
> >>
> >> Thanks,
> >> Jason
>
This is likely off track, but this is a behavior I have noticed since
moving to 11, though it might have started in 10.3-STABLE before moving to
head before 11 went to beta. I can't explain any way nanosleep could be
involved, but I saw annoying lock-ups similar to yours. I also no longer
see them.

I eliminated the annoyance by change scheduler from ULE to 4BSD. That was
it, but I have not seen the issue since. I'd be very interested in whether
the scheduler is somehow impacting timing functions or it's s different
issue. I've felt that there was something off in ULE for some time, but it
was not until these annoying hiccups convinced me to try going back to
4BSD.

Tip o' the hat to Doug B. for his suggestions that ULE may have issues that
impacted interactivity.

Jason Harmening

2016-Nov-02 05:16 UTC

head link

huge nanosleep variance on 11-stable

On 11/01/16 20:45, Kevin Oberman wrote:> On Tue, Nov 1, 2016 at 2:36 PM, Jason Harmening
> <jason.harmening at gmail.com <mailto:jason.harmening at
gmail.com>> wrote:
> 
>     Sorry, that should be ~*30ms* to get 30fps, though the variance is
still
>     up to 500ms for me either way.
> 
>     On 11/01/16 14:29, Jason Harmening wrote:
>     > repro code is at http://pastebin.com/B68N4AFY if anyone's
interested.
>     >
>     > On 11/01/16 13:58, Jason Harmening wrote:
>     >> Hi everyone,
>     >>
>     >> I recently upgraded my main amd64 server from 10.3-stable
>     (r302011) to
>     >> 11.0-stable (r308099).  It went smoothly except for one big
issue:
>     >> certain applications (but not the system as a whole) respond
very
>     >> sluggishly, and video playback of any kind is extremely
choppy.
>     >>
>     >> The system is under very light load, and I see no evidence of
>     abnormal
>     >> interrupt latency or interrupt load.  More interestingly, if I
>     place the
>     >> system under full load (~0.0% idle) the problem *disappears*
and
>     >> playback/responsiveness are smooth and quick.
>     >>
>     >> Running ktrace on some of the affected apps points me at the
problem:
>     >> huge variance in the amount of time spent in the nanosleep
system
>     call.
>     >> A sleep of, say, 5ms might take anywhere from 5ms to ~500ms
from
>     entry
>     >> to return of the syscall.  OTOH, anything CPU-bound or that
waits on
>     >> condvars or I/O interrupts seems to work fine, so this
doesn't
>     seem to
>     >> be an issue with overall system latency.
>     >>
>     >> I can repro this with a simple program that just does a 3ms
>     usleep in a
>     >> tight loop (i.e. roughly the amount of time a video player
would
>     sleep
>     >> between frames @ 30fps).  At light load ktrace will show the
huge
>     >> nanosleep variance; under heavy load every nanosleep will
complete in
>     >> almost exactly 3ms.
>     >>
>     >> FWIW, I don't see this on -current, although right now all
my
>     -current
>     >> images are VMs on different HW so that might not mean
anything.
>     I'm not
>     >> aware of any recent timer- or scheduler- specific changes, so
I'm
>     >> wondering if perhaps the recent IPI or taskqueue changes might
be
>     >> somehow to blame.
>     >>
>     >> I'm not especially familiar w/ the relevant parts of the
kernel,
>     so any
>     >> guidance on where I should focus my debugging efforts would be
much
>     >> appreciated.
>     >>
>     >> Thanks,
>     >> Jason
> 
>  
> This is likely off track, but this is a behavior I have noticed since
> moving to 11, though it might have started in 10.3-STABLE before moving
> to head before 11 went to beta. I can't explain any way nanosleep could
> be involved, but I saw annoying lock-ups similar to yours. I also no
> longer see them.
> 
> I eliminated the annoyance by change scheduler from ULE to 4BSD. That
> was it, but I have not seen the issue since. I'd be very interested in
> whether the scheduler is somehow impacting timing functions or it's s
> different issue. I've felt that there was something off in ULE for some
> time, but it was not until these annoying hiccups convinced me to try
> going back to 4BSD.
> 
> Tip o' the hat to Doug B. for his suggestions that ULE may have issues
> that impacted interactivity.
I figured it out: r282678 (which was never MFCed to 10-stable) added
support for the MWAIT instruction on the idle path for Intel CPUs that
claim to support it.

While my CPU (2009-era Xeon 5500) advertises support for it in its
feature mask and ACPI C-state entries, the cores don't seem to respond
very quickly to interrupts while idling in MWAIT.  Disabling mwait in
acpi_cpu.c and falling back to the old "sti; hlt" mechanism for C1
completely fixes the responsiveness issues.

So if your CPU is of a similar vintage, it may not be ULE's fault.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 585 bytes
Desc: OpenPGP digital signature
URL:
<http://lists.freebsd.org/pipermail/freebsd-stable/attachments/20161101/bd54e698/attachment.sig>

George Mitchell

2016-Nov-02 14:23 UTC

head link

huge nanosleep variance on 11-stable

On 11/01/16 23:45, Kevin Oberman wrote:> On Tue, Nov 1, 2016 at 2:36 PM, Jason Harmening <jason.harmening at
gmail.com>
> wrote:
> 
>> Sorry, that should be ~*30ms* to get 30fps, though the variance is
still
>> up to 500ms for me either way.
>>
>> On 11/01/16 14:29, Jason Harmening wrote:
>>> repro code is at http://pastebin.com/B68N4AFY if anyone's
interested.
>>>
>>> On 11/01/16 13:58, Jason Harmening wrote:
>>>> Hi everyone,
>>>>
>>>> I recently upgraded my main amd64 server from 10.3-stable
(r302011) to
>>>> 11.0-stable (r308099).  It went smoothly except for one big
issue:
>>>> certain applications (but not the system as a whole) respond
very
>>>> sluggishly, and video playback of any kind is extremely choppy.
>>>>
>>>> [...]
> I eliminated the annoyance by change scheduler from ULE to 4BSD. That was
> it, but I have not seen the issue since. I'd be very interested in
whether
> the scheduler is somehow impacting timing functions or it's s different
> issue. I've felt that there was something off in ULE for some time, but
it
> was not until these annoying hiccups convinced me to try going back to
> 4BSD.
> 
> Tip o' the hat to Doug B. for his suggestions that ULE may have issues
that
> impacted interactivity.
> [...]
Not to beat a dead horse, but I've been a non-fan of SCHED_ULE since
it was first introduced, and I don't like it even today.  I run the
distributed.net client on my machines, but even without that, ULE
screws interactive behavior.  With distributed.net running and ULE,
a make buildworld/make buildkernel takes 10 2/3 hours on 10.3-RELEASE
on a 6-CPU machine versus 2 1/2 hours on the same machine with 4BSD
and distributed.net running.  I'm told that SCHED_ULE is the greatest
thing since sliced bread for some compute load or other (details are
scarce), but I (fortunately) don't often have to run heavy server
type loads; and for everyday use (even without distributed.net
running), SCHED_4BSD is my choice by far.  It's too bad I can't run
freebsd_update with it, though.

I promise to shut up about this now.                       -- George

freebsd stable - Nov 2016 - huge nanosleep variance on 11-stable

huge nanosleep variance on 11-stable

huge nanosleep variance on 11-stable

huge nanosleep variance on 11-stable