thr3ads.net - freebsd stable - Appendices - more data: SCHED_ULE+PREEMPTION is the problem [Apr 2018]

If this information is useful, please help other people find it:
Share via:

Stefan Esser

2018-Apr-09 10:07 UTC

more data: SCHED_ULE+PREEMPTION is the problem

Am 07.04.18 um 16:18 schrieb Peter:> 3. kern.sched.preempt_thresh
> 
> I could make the problem disappear by changing kern.sched.preempt_thresh
?from
> the default 80 to either 11 (i5-3570T) or 7 (p3) or smaller. This seems to
> correspond to the disk interrupt threads, which run at intr:12 (i5-3570T)
or
> intr:8 (p3).
[CC added to include Jeff as the author of the ULE scheduler ...]

Since I had somewhat similar problems on my systems (with 4 Quad-Core with SMT
enabled, i.e. 8 threads of execution) with compute bound processes keeping I/O
intensive processes from running (load average of 12 with 8 "CPUs"),
and these
problems where affected by preempt_thresh, I checked how this variable is used
in the scheduler. The code is in /sys/kern/sched_ule.c.

It controls, whether a thread that has become runnable (e.g., after waiting
for disk I/O to complete) will preempt the thread currently running on
"this"
CPU (i.e. the one executing this test in the kernel).

IMHO, sched_preempt should default to a much higher number than 80 (e.g. 190),
but maybe I misunderstand some of the details ...


static inline int
sched_shouldpreempt(int pri, int cpri, int remote)
{

The parameters are:

	pri: the priority if the now runnable thread
	cpri: the priority of the thread that currently runs on "this" CPU
	remote: whether to consider preempting a thread on another CPU

The priority values are those displayed by top or ps -l as "PRI", but
with an
offset of 100 applied (i.e. pri=120 is displayed as PRI=20 in top).

If this thread has less priority than the currently executing one (cpri), the
currently running thread will not be preempted:

    /*


     * If the new priority is not better than the current priority there is


     * nothing to do.


     */
    if (pri >= cpri)
        return (0);

If the current thread is the idle thread, it will always be preempted by the
now runnable thread:

    /*


     * Always preempt idle.


     */
    if (cpri >= PRI_MIN_IDLE)
        return (1);

A value of preempt_thresh=0 (e.g. if "options PREEMPTION" is missing
in the
kernel config) lets the previously running thread continue (except if was the
idle thread, which has been dealt with above). The compute bound thread may
continue until its quantum has expired.

    /*


     * If preemption is disabled don't preempt others.


     */
    if (preempt_thresh == 0)
        return (0);

For any other value of preempt_thresh, the new priority of the thread that
just has become runnable will be compared to preempt_thresh and if this new
priority is higher (lower numeric value) or equal to preempt_thresh, the
thread for which (e.g.) disk I/O finished will preempt the current thread:

    /*


     * Preempt if we exceed the threshold.


     */
    if (pri <= preempt_thresh)
        return (1);

===> This is the only condition that depends on preempt_thresh > 0 <==
The flag "remote" controls whether this thread will be scheduled to
run, if
its priority is higher or equal to PRI_MAX_INTERACT (less than or equal to
151) and if the opposite is true for the currently running thread (cpri).
The value of remote will always be 0 on kernels built without "options
SMP".

    /*


     * If we're interactive or better and there is non-interactive


     * or worse running preempt only remote processors.


     */
    if (remote && pri <= PRI_MAX_INTERACT && cpri >
PRI_MAX_INTERACT)
        return (1);


The critical use of preempt_thresh is marked above. If it is 0, no preemption
will occur. On a single processor system, this should allow the CPU bound
thread to run for as long its quantum lasts.

A value of 120 (corresponding to PRI=20 in top) will allow the I/O bound
thread to preempt any other thread with lower priority (cpri > pri). But in
case of a high priority kernel thread being active during this test (with a
low numeric cpri value), the I/O bound process will not preempt that higher
priority thread (i.e. some high priority kernel thread).

Whether the I/O bound thread will run (instead of the compute bound) after
the higher priority thread has given up the CPU, will depend on the scheduler
decision which thread to select. And for "timeshare" threads, this
will often
not be the higher priority (I/O bound) thread, but the compute bound thread,
which then may execute until next being interrupted by the I/O bound thread
(which will not happen, if no new I/O has been requested).

This might explain, why setting preempt_thresh to a very low value (in the
range of real-time kernel threads) enforces preemption of the CPU bound
thread, while any higher (numeric) value of preempt_thresh prevents this
and makes tdq_choose() often select the low priority CPU bound over the
higher priority I/O bound thread.

BUT the first test in sched_shouldpreempt() should prevent any user process
from ever preempting a real-time thread "if (pri >= cpri) return
0;".

For preemption to occur,  pri must be numerically lower than cpri, and
pri must be numerically lower than or equal to preempt_thresh.
> a. with kern.sched.preempt_thresh=80
> 
> $ lz4 DATABASE_TEST_FILE /dev/null & while true;
> ? do ps -o pid,pri,"%cpu",command -p 2119,$!
> ? sleep 3
> done
> [1] 6073
[...]> ?PID PRI %CPU COMMAND
> 6073? 52? 6.5 lz4 DATABASE_TEST_FILE /dev/null
> 2119? 99 91.5 -bash (bash)
The I/O bound thread does not preempt the compute bound thread, when becoming
runnable (data arrived from disk).

With the value of preempt_thresh=80 (corresponding to PRI=-20) only real-time
threads may cause preemption, the I/O bound thread can not (PRI=52 / pri=152).

A value of preempt_thresh in the range of 190 (corresponding to PRI=90) should
allow the lz4 process to preempt the CPU bound process (with higher pri/PRI).
> b. with kern.sched.preempt_thresh=11
> 
> ?PID PRI %CPU COMMAND
> 4920? 21? 0.0 lz4 DATABASE_TEST_FILE /dev/null
> 2119 101 93.5 -bash (bash)
[...]> ?PID PRI %CPU COMMAND
> 4920? 85 43.0 lz4 DATABASE_TEST_FILE /dev/null
> 2119? 85 45.5 -bash (bash)
Such a low preempt_thresh does not allow any user process to preempt any other
one (except when running with temporarily increased priority in the kernel).

Only a kernel thread (soft interrupt?) at might cause preemption, and if the
interrupt is due to a read issued by the I/O bound thread completing, then the
I/O bound process is not the one being preempted. This will make the timeshare
scheduler select the process with higher priority (lower PRI) that did not
recently run (i.e. the I/O bound process, if both have the same PRI), when the
kernel thread goes to sleep.

But (if my analysis is correct) this indicates, that preempt_thresh set to an
extremely low value just helps by accident. The kernel thread interrupts the
CPU bound thread, and the I/O bound thread is selected as the next runnable
thread in the time-share run queue, either because of its lower PRI value or
because it did not run last before the preemption occurred (with equal PRI
for both).

But, in fact, the same scheduler selection should have occured in test (a),
too, if e.g. a soft interrupt preempts the compute bound thread. Not sure,
why this does not happen ... (And this may be an indication, that I do not
fully understand what's going on ;-) ...)
> From this we can see that in case b. both processes balance out nicely and
> meet at equal CPU shares.
> Whereas in case a., after about 10 Seconds (the first 3 records) they move
to
> opposite ends of the scale and stay there.
> 
> From this I might suppose that here is some kind of mis-calculation or
> mis-adjustment of the task priorities happening.
I'd be interested in your results with preempt_thresh set to a value of e.g.
190. The PRI=85 values in your test case (b) correspond to pri=185, and with
preempt_thresh slightly higher that that, the lz4 process should still get a
50% share of the CPU. (If its PRI grows over that of the CPU bound process,
it will not be able to preempt it, so its PRI should match the one of the CPU
bound process).

Regards, STefan

Peter

2018-Apr-10 11:22 UTC

head link

more data: SCHED_ULE+PREEMPTION is the problem

Hi Stefan,

  I'm glad to see You're thinking along similar paths as I did. But let 
me fist answer Your question straight away, and sort out the remainder
afterwards.

 > I'd be interested in your results with preempt_thresh set to a value
 > of e.g.190.

There is no difference. Any value above 7 shows the problem identically.

I think this value (or preemtion as a whole) is not the actual cause for 
the problem; it just changes some conditions that make the problem 
visible. So, trying to adjust preempt_thresh in order to fix the
problem seems to be a dead end.

Stefan Esser wrote:
> The critical use of preempt_thresh is marked above. If it is 0, no
preemption
> will occur. On a single processor system, this should allow the CPU bound
> thread to run for as long its quantum lasts.
I would like to contradict here.

 From what I understand, preemption is *not* the base of task switching.
AFAIK preemption is an additional feature that allows to switch threads
while they execute in kernel mode. While executing in user mode, a 
thread can be interrupted and switched at any time, and that is how
the traditional time-sharing systems did it. Traditionally a thread
would execute in kernel mode only during interrupts and syscalls, and
those last no longer than a few ms, and for long that was not an issue. 
Only when we got the fast interfaces (10Gbps etc.) and got big monsters 
executing in kernel space (traffic-shaper, ZFS, etc.), that scheme 
became problematic and preemption was invented.

According to McKusicks book, the scheduler is two-fold: an outer logic
runs few times per second and calculates priorities. And an inner logic
runs very often (at every interrupt?) and chooses the next runnable 
thread simply by priority.
The meaning of the quantum is then: when it is used up, the thread is 
moved to the end of it's queue, so that it may take a while until it 
runs again. This is for implementing round-robin behaviour within a
single queue (= a single priority). It should not prevent task-switching 
as such.

Lets have a look. sched_choose() seems to be that low-level scheduler 
function that decides which thread to run next. Lets create a log of its 
decisions.[1]

With preempt_thresh >= 12 (kernel threads left out):

  PID                COMMAND TIMESTAMP
  18196                 bash 1192.549
  18196                 bash 1192.554
  18196                 bash 1192.559
  66683                  lz4 1192.560
  18196                 bash 1192.560
  18196                 bash 1192.562
  18196                 bash 1192.563
  18196                 bash 1192.564
  79496                 ntpd 1192.569
  18196                 bash 1192.569
  18196                 bash 1192.574
  18196                 bash 1192.579
  18196                 bash 1192.584
  18196                 bash 1192.588
  18196                 bash 1192.589
  18196                 bash 1192.594
  18196                 bash 1192.599
  18196                 bash 1192.604
  18196                 bash 1192.609
  18196                 bash 1192.613
  18196                 bash 1192.614
  18196                 bash 1192.619
  18196                 bash 1192.624
  18196                 bash 1192.629
  18196                 bash 1192.634
  18196                 bash 1192.638
  18196                 bash 1192.639
  18196                 bash 1192.644
  18196                 bash 1192.649
  18196                 bash 1192.654
  66683                  lz4 1192.654
  18196                 bash 1192.655
  18196                 bash 1192.655
  18196                 bash 1192.659

The worker is indeed called only after 95ms.

And with preempt_thresh < 8:

  PID                COMMAND TIMESTAMP

  18196                 bash 1268.955
  66683                  lz4 1268.956
  18196                 bash 1268.956
  66683                  lz4 1268.956
  18196                 bash 1268.957
  66683                  lz4 1268.957
  18196                 bash 1268.957
  66683                  lz4 1268.958
  18196                 bash 1268.958
  66683                  lz4 1268.959
  18196                 bash 1268.959
  66683                  lz4 1268.959
  18196                 bash 1268.960
  66683                  lz4 1268.960
  18196                 bash 1268.961
  66683                  lz4 1268.961
  18196                 bash 1268.961
  66683                  lz4 1268.962
  18196                 bash 1268.962

Here we have 3 Csw per millisecond. (The fact that the decisions are 
over-all more frequent is easily explained: when lz4 gets to run, it 
will do disk I/O, which quickly returns and triggers new decisions.)

In the second record, things are clear: while lz4 does disk I/O, the 
scheduler MUST run bash, because nothing else is there. But when data 
arrives, it runs again lz4.
But in the first record - why does the scheduler choose bash, although
lz4 has already much higher prio (52 versus 97, usually)?
> A value of 120 (corresponding to PRI=20 in top) will allow the I/O bound
> thread to preempt any other thread with lower priority (cpri > pri). But
in
> case of a high priority kernel thread being active during this test (with a
> low numeric cpri value), the I/O bound process will not preempt that higher
> priority thread (i.e. some high priority kernel thread).
>
> Whether the I/O bound thread will run (instead of the compute bound) after
> the higher priority thread has given up the CPU, will depend on the
scheduler
> decision which thread to select. And for "timeshare" threads,
this will often
> not be the higher priority (I/O bound) thread, but the compute bound
thread,
> which then may execute until next being interrupted by the I/O bound thread
> (which will not happen, if no new I/O has been requested).
Exactly. But why does this happen?

Following the trail onwards from sched_choose() we find tdq_choose(), 
and that is quite straightforward looking for something runnable.
The realtime and idle queues are simple and should grab the thread with 
highest priority, but the timeshare queue (which is the interesting 
piece here) has a special gimmick.
It calls kern_sched.c:runq_choose_from() with an extra parameter 
tdq_ridx ("Current removal index").

 From the design papers I get the clue that the heads of the timeshare 
runqueues are arranged in some kind of circular buffer - this is a 
specialty of SCHED_ULE; SCHED_4BSD does not have this feature, and does 
not call runq_choose_from(). And the tdq_ridx is the current index onto 
that circle.

This mechanism (which I do not really understand currently) is the AFAIK 
only "element of distortion", which may cause that *not* the highest 
prio thread is selected to run.
> For preemption to occur,  pri must be numerically lower than cpri, and
> pri must be numerically lower than or equal to preempt_thresh.
Maybe. But as I wrote above, preemption is only concerning threads 
running in kernel mode. And what I want, is simply a normal task 
switching between the two processes running in user mode.
> But, in fact, the same scheduler selection should have occured in test (a),
> too, if e.g. a soft interrupt preempts the compute bound thread. Not sure,
> why this does not happen ... (And this may be an indication, that I do not
> fully understand what's going on ;-) ...)
I think You do understand it - or we both don't understand it ;) - and 
there is still something else going on here which we do not yet see.
> The PRI=85 values in your test case (b) correspond to pri=185, and with
> preempt_thresh slightly higher that that, the lz4 process should still get
a
> 50% share of the CPU. (If its PRI grows over that of the CPU bound process,
> it will not be able to preempt it, so its PRI should match the one of the
CPU
> bound process).
As mentioned above, this does not work in practice.

What I did instead was take a look at this circular index tdq_ridx.

And what I found there is very strange:
That index indeed moves from 0 to 63 (we have only one runq every 4 
priorities) and repeats, and it takes 1/2 second to get around.

Except when the problem appears - then it suddenly takes more than
six seconds to get around!

tdq_ridx is defined within sched_ule.c, and it is modified only here:

 >         * Advance the insert index once for each tick to ensure that all
 >         * threads get a chance to run.
 >         */
 >        if (tdq->tdq_idx == tdq->tdq_ridx) {
 >                tdq->tdq_idx = (tdq->tdq_idx + 1) % RQ_NQS;
 >                if 
(TAILQ_EMPTY(&tdq->tdq_timeshare.rq_queues[tdq->tdq_ridx]))
 >                        tdq->tdq_ridx = tdq->tdq_idx;

The comment suggests that tdq_ridx should increment with timer ticks, 
i.e. with constant speed! And, if the clock is stathz, that does figure:
64 / stathz = 0.5 seconds

Here are the logs [2] showing that indeed the speed changes greatly, but 
only in the situation when the problem appears:

preempt_thresh = 0:
           tdq_ridx    timestamp
                 0     32.948718044
                 0     33.453705474
                 0     33.998990897
                 0     34.503690368
                 0     35.003946192
                 0     35.508674863
                 0     36.016583301
                 0     36.518663899

preempt_thresh = 80:
                 0     69.056754055
                 0     69.561710522
                 0     70.066654974
                 0     70.571848226
                 0     71.076660214
                 0     71.576621653
                 0     72.081640256
                 0     72.586606139

piglet running: (and adding half-round 32 for proof)
                 0     32.100767726
                32     32.369727157
                 0     32.619581572
                32     32.965831183
                 0     33.220777200
                32     33.480751369
                 0     33.730745086
                32     33.980736953
                 0     34.235749995
                32     34.494315832
                 0     34.744162342
                32     35.050767367
                 0     35.315405397
                32     35.560715408
                 0     35.820902945
                32     36.068976384

piglet and worker running: now clock is slow!
                32     67.164163127
                 0     70.368819114
                32     73.581467022
                 0     76.817670952
                32     77.195794807
                 0     77.465528087
                32     77.778554940
                 0     78.644335854
                32     82.045835390
                 0     85.258399007
                32     88.478827844
                 0     91.746418960
                32     95.226730336

preempt_thresh back to 0:
                32     47.988086885
                 0     48.255774908
                32     48.507861276
                 0     48.765327585
                32     49.011779231
                 0     49.268056274
                32     49.515563800
                 0     49.815179616
                32     50.067504569
                 0     50.322943146
                32     50.572696557
                 0     50.823302445
                32     51.074755232

I do currently not see how (or if) this would lead to the suboptimal 
selection of a runnable thread - but it might indeed do so. And, if 
there is really a problem with the clock, this may also have further 
effects.

But then I have no clue what to make of this, or where to track it 
further. It might be that there is no problem with the scheduler, but 
instead the happening preemption does something bad with the clock.
Or whatever...

P.

Peter

2018-Apr-10 11:26 UTC

head link

Appendices - more data: SCHED_ULE+PREEMPTION is the problem

I forgot to attach the commands used to create the logs - they are ugly 
anyway:

[1]
dtrace -q -n '::sched_choose:return { @[((struct thread 
*)arg1)->td_proc->p_pid, stringof(((struct thread 
*)arg1)->td_proc->p_comm), timestamp] = count(); } tick-1s { exit(0);
}'
| sort -nk 3 | awk '$1 > 27 {$3 = ($3/1000000)*1.0/1000; printf "%6d
%20s %3.3f\n", $1, $2, $3 }'

[2]
dtrace -q -n '::runq_choose_from:entry /arg1 == 0||arg1 == 32/ { @[arg1, 
timestamp] = count(); }' | sort -nk2

freebsd stable - Apr 2018 - Appendices - more data: SCHED_ULE+PREEMPTION is the problem

more data: SCHED_ULE+PREEMPTION is the problem

more data: SCHED_ULE+PREEMPTION is the problem

Appendices - more data: SCHED_ULE+PREEMPTION is the problem