thr3ads.net - freebsd stable - more data: SCHED_ULE+PREEMPTION is the problem [Apr 2018]

If this information is useful, please help other people find it:
Share via:

Peter

2018-Apr-07 14:18 UTC

more data: SCHED_ULE+PREEMPTION is the problem (was: kern.sched.quantum: Creepy, sadistic scheduler

Hi all,
  in the meantime I did some tests and found the following:


A. The Problem:
---------------
On a single CPU, there are -exactly- two processes runnable:
One is doing mostly compute without I/O - this can be a compressing job 
or similar; in the tests I used simply an endless-loop. Lets call this 
the "piglet".
The other is doing frequent file reads, but also some compute interim -
this can be a backup job traversing the FS, or a postgres VACUUM, or 
some fast compressor like lz4. Lets call this the "worker".

It then happens that the piglet gets 99% CPU, while the worker gets only 
0.5% CPU and makes nearly no progress at all.

Investigations shows that the worker makes precisely one I/O per 
timeslice (timeslice as defined in kern.sched.quantum) - or two I/O on a 
mirrored ZFS.


B. Findings:
------------
1. Filesystem

I could never reproduce this when reading from plain UFS. Only when 
reading from ZFS (direct or via l2arc).

2. Machine

The problem originally appeared on a pentium3 at 1GHz. I was able to 
reproduce it on an i5-3570T, given the following measures:
  * config in BIOS to use only one CPU
  * reduce speed: "dev.cpu.0.freq=200"
I did see the problem also when running full speed (which means it 
happens there also), but could not reproduce it well.

3. kern.sched.preempt_thresh

I could make the problem disappear by changing kern.sched.preempt_thresh 
  from the default 80 to either 11 (i5-3570T) or 7 (p3) or smaller. This 
seems to correspond to the disk interrupt threads, which run at intr:12 
(i5-3570T) or intr:8 (p3).

4. dynamic behaviour

Here the piglet is already running as PID=2119. Then we can watch the 
dynamic behaviour as follows (on i5-3570T at 200MHz):

a. with kern.sched.preempt_thresh=80

$ lz4 DATABASE_TEST_FILE /dev/null & while true; 

   do ps -o pid,pri,"%cpu",command -p 2119,$! 

   sleep 3 

done 

[1] 6073 

  PID PRI %CPU COMMAND 

6073  20  0.0 lz4 DATABASE_TEST_FILE /dev/null 

2119 100 91.0 -bash (bash) 

  PID PRI %CPU COMMAND 

6073  76 15.0 lz4 DATABASE_TEST_FILE /dev/null 

2119  95 74.5 -bash (bash) 

  PID PRI %CPU COMMAND 

6073  52 19.0 lz4 DATABASE_TEST_FILE /dev/null 

2119  94 71.5 -bash (bash) 

  PID PRI %CPU COMMAND 

6073  52 16.0 lz4 DATABASE_TEST_FILE /dev/null 

2119  95 76.5 -bash (bash) 

  PID PRI %CPU COMMAND 

6073  52 14.0 lz4 DATABASE_TEST_FILE /dev/null 

2119  96 80.0 -bash (bash) 

  PID PRI %CPU COMMAND 

6073  52 12.5 lz4 DATABASE_TEST_FILE /dev/null 

2119  96 82.5 -bash (bash) 

  PID PRI %CPU COMMAND 

6073  74 10.0 lz4 DATABASE_TEST_FILE /dev/null 

2119  98 86.5 -bash (bash) 

  PID PRI %CPU COMMAND 

6073  52  8.0 lz4 DATABASE_TEST_FILE /dev/null 

2119  98 89.0 -bash (bash) 

  PID PRI %CPU COMMAND 

6073  52  7.0 lz4 DATABASE_TEST_FILE /dev/null 

2119  98 90.5 -bash (bash) 

  PID PRI %CPU COMMAND 

6073  52  6.5 lz4 DATABASE_TEST_FILE /dev/null 

2119  99 91.5 -bash (bash)

b. with kern.sched.preempt_thresh=11

  PID PRI %CPU COMMAND 

4920  21  0.0 lz4 DATABASE_TEST_FILE /dev/null 

2119 101 93.5 -bash (bash) 

  PID PRI %CPU COMMAND 

4920  78 20.0 lz4 DATABASE_TEST_FILE /dev/null 

2119  94 70.5 -bash (bash) 

  PID PRI %CPU COMMAND 

4920  82 34.5 lz4 DATABASE_TEST_FILE /dev/null 

2119  88 54.0 -bash (bash) 

  PID PRI %CPU COMMAND 

4920  85 42.5 lz4 DATABASE_TEST_FILE /dev/null 

2119  86 45.0 -bash (bash) 

  PID PRI %CPU COMMAND 

4920  85 43.5 lz4 DATABASE_TEST_FILE /dev/null 

2119  86 44.5 -bash (bash) 

  PID PRI %CPU COMMAND 

4920  85 43.0 lz4 DATABASE_TEST_FILE /dev/null 

2119  85 45.0 -bash (bash) 

  PID PRI %CPU COMMAND 

4920  85 43.0 lz4 DATABASE_TEST_FILE /dev/null 

2119  85 45.5 -bash (bash)

 From this we can see that in case b. both processes balance out nicely 
and meet at equal CPU shares.
Whereas in case a., after about 10 Seconds (the first 3 records) they 
move to opposite ends of the scale and stay there.

 From this I might suppose that here is some kind of mis-calculation or 
mis-adjustment of the task priorities happening.

P.

George Mitchell

2018-Apr-08 18:41 UTC

head link

more data: SCHED_ULE+PREEMPTION is the problem

On 04/07/18 10:18, Peter wrote:> [...]
> B. Findings:
> ------------
> 1. Filesystem
> 
> I could never reproduce this when reading from plain UFS. Only when
> reading from ZFS (direct or via l2arc).
> [...]My consistent way of reproducing the problem was to run ports/misc/dnetc
while trying to buildworld/buildkernel.  But I was always building on a
UFS partition.  I don't know why your results would be different.
-- George

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: OpenPGP digital signature
URL:
<http://lists.freebsd.org/pipermail/freebsd-stable/attachments/20180408/5dfde18c/attachment.sig>

Stefan Esser

2018-Apr-09 10:07 UTC

head link

more data: SCHED_ULE+PREEMPTION is the problem

Am 07.04.18 um 16:18 schrieb Peter:> 3. kern.sched.preempt_thresh
> 
> I could make the problem disappear by changing kern.sched.preempt_thresh
?from
> the default 80 to either 11 (i5-3570T) or 7 (p3) or smaller. This seems to
> correspond to the disk interrupt threads, which run at intr:12 (i5-3570T)
or
> intr:8 (p3).
[CC added to include Jeff as the author of the ULE scheduler ...]

Since I had somewhat similar problems on my systems (with 4 Quad-Core with SMT
enabled, i.e. 8 threads of execution) with compute bound processes keeping I/O
intensive processes from running (load average of 12 with 8 "CPUs"),
and these
problems where affected by preempt_thresh, I checked how this variable is used
in the scheduler. The code is in /sys/kern/sched_ule.c.

It controls, whether a thread that has become runnable (e.g., after waiting
for disk I/O to complete) will preempt the thread currently running on
"this"
CPU (i.e. the one executing this test in the kernel).

IMHO, sched_preempt should default to a much higher number than 80 (e.g. 190),
but maybe I misunderstand some of the details ...


static inline int
sched_shouldpreempt(int pri, int cpri, int remote)
{

The parameters are:

	pri: the priority if the now runnable thread
	cpri: the priority of the thread that currently runs on "this" CPU
	remote: whether to consider preempting a thread on another CPU

The priority values are those displayed by top or ps -l as "PRI", but
with an
offset of 100 applied (i.e. pri=120 is displayed as PRI=20 in top).

If this thread has less priority than the currently executing one (cpri), the
currently running thread will not be preempted:

    /*


     * If the new priority is not better than the current priority there is


     * nothing to do.


     */
    if (pri >= cpri)
        return (0);

If the current thread is the idle thread, it will always be preempted by the
now runnable thread:

    /*


     * Always preempt idle.


     */
    if (cpri >= PRI_MIN_IDLE)
        return (1);

A value of preempt_thresh=0 (e.g. if "options PREEMPTION" is missing
in the
kernel config) lets the previously running thread continue (except if was the
idle thread, which has been dealt with above). The compute bound thread may
continue until its quantum has expired.

    /*


     * If preemption is disabled don't preempt others.


     */
    if (preempt_thresh == 0)
        return (0);

For any other value of preempt_thresh, the new priority of the thread that
just has become runnable will be compared to preempt_thresh and if this new
priority is higher (lower numeric value) or equal to preempt_thresh, the
thread for which (e.g.) disk I/O finished will preempt the current thread:

    /*


     * Preempt if we exceed the threshold.


     */
    if (pri <= preempt_thresh)
        return (1);

===> This is the only condition that depends on preempt_thresh > 0 <==
The flag "remote" controls whether this thread will be scheduled to
run, if
its priority is higher or equal to PRI_MAX_INTERACT (less than or equal to
151) and if the opposite is true for the currently running thread (cpri).
The value of remote will always be 0 on kernels built without "options
SMP".

    /*


     * If we're interactive or better and there is non-interactive


     * or worse running preempt only remote processors.


     */
    if (remote && pri <= PRI_MAX_INTERACT && cpri >
PRI_MAX_INTERACT)
        return (1);


The critical use of preempt_thresh is marked above. If it is 0, no preemption
will occur. On a single processor system, this should allow the CPU bound
thread to run for as long its quantum lasts.

A value of 120 (corresponding to PRI=20 in top) will allow the I/O bound
thread to preempt any other thread with lower priority (cpri > pri). But in
case of a high priority kernel thread being active during this test (with a
low numeric cpri value), the I/O bound process will not preempt that higher
priority thread (i.e. some high priority kernel thread).

Whether the I/O bound thread will run (instead of the compute bound) after
the higher priority thread has given up the CPU, will depend on the scheduler
decision which thread to select. And for "timeshare" threads, this
will often
not be the higher priority (I/O bound) thread, but the compute bound thread,
which then may execute until next being interrupted by the I/O bound thread
(which will not happen, if no new I/O has been requested).

This might explain, why setting preempt_thresh to a very low value (in the
range of real-time kernel threads) enforces preemption of the CPU bound
thread, while any higher (numeric) value of preempt_thresh prevents this
and makes tdq_choose() often select the low priority CPU bound over the
higher priority I/O bound thread.

BUT the first test in sched_shouldpreempt() should prevent any user process
from ever preempting a real-time thread "if (pri >= cpri) return
0;".

For preemption to occur,  pri must be numerically lower than cpri, and
pri must be numerically lower than or equal to preempt_thresh.
> a. with kern.sched.preempt_thresh=80
> 
> $ lz4 DATABASE_TEST_FILE /dev/null & while true;
> ? do ps -o pid,pri,"%cpu",command -p 2119,$!
> ? sleep 3
> done
> [1] 6073
[...]> ?PID PRI %CPU COMMAND
> 6073? 52? 6.5 lz4 DATABASE_TEST_FILE /dev/null
> 2119? 99 91.5 -bash (bash)
The I/O bound thread does not preempt the compute bound thread, when becoming
runnable (data arrived from disk).

With the value of preempt_thresh=80 (corresponding to PRI=-20) only real-time
threads may cause preemption, the I/O bound thread can not (PRI=52 / pri=152).

A value of preempt_thresh in the range of 190 (corresponding to PRI=90) should
allow the lz4 process to preempt the CPU bound process (with higher pri/PRI).
> b. with kern.sched.preempt_thresh=11
> 
> ?PID PRI %CPU COMMAND
> 4920? 21? 0.0 lz4 DATABASE_TEST_FILE /dev/null
> 2119 101 93.5 -bash (bash)
[...]> ?PID PRI %CPU COMMAND
> 4920? 85 43.0 lz4 DATABASE_TEST_FILE /dev/null
> 2119? 85 45.5 -bash (bash)
Such a low preempt_thresh does not allow any user process to preempt any other
one (except when running with temporarily increased priority in the kernel).

Only a kernel thread (soft interrupt?) at might cause preemption, and if the
interrupt is due to a read issued by the I/O bound thread completing, then the
I/O bound process is not the one being preempted. This will make the timeshare
scheduler select the process with higher priority (lower PRI) that did not
recently run (i.e. the I/O bound process, if both have the same PRI), when the
kernel thread goes to sleep.

But (if my analysis is correct) this indicates, that preempt_thresh set to an
extremely low value just helps by accident. The kernel thread interrupts the
CPU bound thread, and the I/O bound thread is selected as the next runnable
thread in the time-share run queue, either because of its lower PRI value or
because it did not run last before the preemption occurred (with equal PRI
for both).

But, in fact, the same scheduler selection should have occured in test (a),
too, if e.g. a soft interrupt preempts the compute bound thread. Not sure,
why this does not happen ... (And this may be an indication, that I do not
fully understand what's going on ;-) ...)
> From this we can see that in case b. both processes balance out nicely and
> meet at equal CPU shares.
> Whereas in case a., after about 10 Seconds (the first 3 records) they move
to
> opposite ends of the scale and stay there.
> 
> From this I might suppose that here is some kind of mis-calculation or
> mis-adjustment of the task priorities happening.
I'd be interested in your results with preempt_thresh set to a value of e.g.
190. The PRI=85 values in your test case (b) correspond to pri=185, and with
preempt_thresh slightly higher that that, the lz4 process should still get a
50% share of the CPU. (If its PRI grows over that of the CPU bound process,
it will not be able to preempt it, so its PRI should match the one of the CPU
bound process).

Regards, STefan

George Mitchell

2018-Apr-17 14:06 UTC

head link

more data: SCHED_ULE+PREEMPTION is the problem

On 04/07/18 10:18, Peter wrote:> Hi all,
> [...]
Thanks for all the investigation!> 3. kern.sched.preempt_thresh
> 
> I could make the problem disappear by changing kern.sched.preempt_thresh
> ?from the default 80 to either 11 (i5-3570T) or 7 (p3) or smaller. This
> seems to correspond to the disk interrupt threads, which run at intr:12
> (i5-3570T) or intr:8 (p3).
> [...]
More data.  With SCHED_4BSD at FreeBSD 10.4-RELEASE-p8 #0 r331984:
kern.sched.runq_fuzz: 1
kern.sched.ipiwakeup.useloop: 0
kern.sched.ipiwakeup.usemask: 1
kern.sched.ipiwakeup.delivered: 376139898
kern.sched.ipiwakeup.requested: 376137875
kern.sched.ipiwakeup.enabled: 1
kern.sched.slice: 12
kern.sched.quantum: 94488
kern.sched.name: 4BSD
kern.sched.preemption: 1
kern.sched.cpusetsize: 8
With dnetc running on a 6-core AMD CPU from a few years back,
"time make buildworld" yields:

6640.224u 828.874s 2:14:37.73 92.4%     28525+494k 31633+431554io 33192pf+0w

I shifted to a GENERIC kernel, FreeBSD 10.4-RELEASE-p8 #0 r332560:
kern.sched.topology_spec: <groups>
 <group level="1" cache-level="0">
  <cpu count="6" mask="3f">0, 1, 2, 3, 4,
5</cpu>
  <children>
   <group level="2" cache-level="2">
    <cpu count="6" mask="3f">0, 1, 2, 3, 4,
5</cpu>
   </group>
  </children>
 </group>
</groups>

kern.sched.steal_thresh: 2
kern.sched.steal_idle: 1
kern.sched.balance_interval: 127
kern.sched.balance: 1
kern.sched.affinity: 1
kern.sched.idlespinthresh: 157
kern.sched.idlespins: 10000
kern.sched.static_boost: 152
kern.sched.preempt_thresh: 80
kern.sched.interact: 30
kern.sched.slice: 12
kern.sched.quantum: 94488
kern.sched.name: ULE
kern.sched.preemption: 1
kern.sched.cpusetsize: 8

I stupidly typed "make buildworld" without the "time"
command, but the
build log started at Mon Apr 16 13:49:12 EDT 2018 and completed at
Tue Apr 17 00:22:23 EDT 2018.  You read that right: 2+ hours vs 10 1/2!
So I set "sysctl kern.sched.preempt_thresh=5" (a wild guess on my
part)
and started another "time make buildworld".  It's still going now,
but
subjectively it's still running like molasses.  I'll post more results
later after trying sysctl kern.sched.preempt_thresh=0.

By the way, over the years that this discussion has been going on, I've
*never* had a response to my question: "What is the workload for which
SCHED_ULE outperforms SCHED_4BSD?"                            -- George


-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: OpenPGP digital signature
URL:
<http://lists.freebsd.org/pipermail/freebsd-stable/attachments/20180417/c8d05d29/attachment.sig>

freebsd stable - Apr 2018 - more data: SCHED_ULE+PREEMPTION is the problem

more data: SCHED_ULE+PREEMPTION is the problem (was: kern.sched.quantum: Creepy, sadistic scheduler

more data: SCHED_ULE+PREEMPTION is the problem

more data: SCHED_ULE+PREEMPTION is the problem

more data: SCHED_ULE+PREEMPTION is the problem