thr3ads.net - freebsd stable - kern.sched.quantum: Creepy, sadistic scheduler [Apr 2018]

If this information is useful, please help other people find it:
Share via:

Peter

2018-Apr-04 00:52 UTC

kern.sched.quantum: Creepy, sadistic scheduler

Occasionally I noticed that the system would not quickly process the
tasks i need done, but instead prefer other, longrunning tasks. I
figured it must be related to the scheduler, and decided it hates me.


A closer look shows the behaviour as follows (single CPU):

Lets run an I/O-active task, e.g, postgres VACUUM that would
continuousely read from big files (while doing compute as well [1]):
 >pool        alloc   free   read  write   read  write
 >cache           -      -      -      -      -      -
 >  ada1s4    7.08G  10.9G  1.58K      0  12.9M      0

Now start an endless loop:
# while true; do :; done

And the effect is:
 >pool        alloc   free   read  write   read  write
 >cache           -      -      -      -      -      -
 >  ada1s4    7.08G  10.9G      9      0  76.8K      0

The VACUUM gets almost stuck! This figures with WCPU in "top":

 >  PID USERNAME   PRI NICE   SIZE    RES STATE    TIME    WCPU COMMAND
 >85583 root        99    0  7044K  1944K RUN      1:06  92.21% bash
 >53005 pgsql       52    0   620M 91856K RUN      5:47   0.50% postgres

Hacking on kern.sched.quantum makes it quite a bit better:
# sysctl kern.sched.quantum=1
kern.sched.quantum: 94488 -> 7874

 >pool        alloc   free   read  write   read  write
 >cache           -      -      -      -      -      -
 >  ada1s4    7.08G  10.9G    395      0  3.12M      0

 >  PID USERNAME   PRI NICE   SIZE    RES STATE    TIME    WCPU COMMAND
 >85583 root        94    0  7044K  1944K RUN      4:13  70.80% bash
 >53005 pgsql       52    0   276M 91856K RUN      5:52  11.83% postgres


Now, as usual, the "root-cause" questions arise: What exactly does
this "quantum"? Is this solution a workaround, i.e. actually something
else is wrong, and has it tradeoff in other situations? Or otherwise,
why is such a default value chosen, which appears to be ill-deceived?

The docs for the quantum parameter are a bit unsatisfying - they say
its the max num of ticks a process gets - and what happens when
they're exhausted? If by default the endless loop is actually allowed
to continue running for 94k ticks (or 94ms, more likely) uninterrupted,
then that explains the perceived behaviour - buts thats certainly not
what a scheduler should do when other procs are ready to run.

11.1-RELEASE-p7, kern.hz=200. Switching tickless mode on or off does
not influence the matter. Starting the endless loop with "nice" does
not influence the matter.


[1]
A pure-I/O job without compute load, like "dd", does not show
this behaviour. Also, when other tasks are running, the unjust
behaviour is not so stongly pronounced.

Alban Hertroys

2018-Apr-04 10:39 UTC

head link

kern.sched.quantum: Creepy, sadistic scheduler

> On 4 Apr 2018, at 2:52, Peter <pmc at citylink.dinoex.sub.org> wrote:
> 
> Occasionally I noticed that the system would not quickly process the
> tasks i need done, but instead prefer other, longrunning tasks. I
> figured it must be related to the scheduler, and decided it hates me.
If it hated you, it would behave much worse.
> A closer look shows the behaviour as follows (single CPU):
A single CPU? That's becoming rare! Is that a VM? Old hardware? Something
really specific?
> Lets run an I/O-active task, e.g, postgres VACUUM that would
And you're running a multi-process database server on it no less. That is
going to hurt, no matter how well the scheduler works.
> continuousely read from big files (while doing compute as well [1]):
> >pool        alloc   free   read  write   read  write
> >cache           -      -      -      -      -      -
> >  ada1s4    7.08G  10.9G  1.58K      0  12.9M      0
> 
> Now start an endless loop:
> # while true; do :; done
> 
> And the effect is:
> >pool        alloc   free   read  write   read  write
> >cache           -      -      -      -      -      -
> >  ada1s4    7.08G  10.9G      9      0  76.8K      0
> 
> The VACUUM gets almost stuck! This figures with WCPU in "top":
> 
> >  PID USERNAME   PRI NICE   SIZE    RES STATE    TIME    WCPU COMMAND
> >85583 root        99    0  7044K  1944K RUN      1:06  92.21% bash
> >53005 pgsql       52    0   620M 91856K RUN      5:47   0.50% postgres
> 
> Hacking on kern.sched.quantum makes it quite a bit better:
> # sysctl kern.sched.quantum=1
> kern.sched.quantum: 94488 -> 7874
> 
> >pool        alloc   free   read  write   read  write
> >cache           -      -      -      -      -      -
> >  ada1s4    7.08G  10.9G    395      0  3.12M      0
> 
> >  PID USERNAME   PRI NICE   SIZE    RES STATE    TIME    WCPU COMMAND
> >85583 root        94    0  7044K  1944K RUN      4:13  70.80% bash
> >53005 pgsql       52    0   276M 91856K RUN      5:52  11.83% postgres
> 
> 
> Now, as usual, the "root-cause" questions arise: What exactly
does
> this "quantum"? Is this solution a workaround, i.e. actually
something
> else is wrong, and has it tradeoff in other situations? Or otherwise,
> why is such a default value chosen, which appears to be ill-deceived?
> 
> The docs for the quantum parameter are a bit unsatisfying - they say
> its the max num of ticks a process gets - and what happens when
> they're exhausted? If by default the endless loop is actually allowed
> to continue running for 94k ticks (or 94ms, more likely) uninterrupted,
> then that explains the perceived behaviour - buts thats certainly not
> what a scheduler should do when other procs are ready to run.
I can answer this from the operating systems course I followed recently. This
does not apply to FreeBSD specifically, it is general job scheduling theory. I
still need to read up on SCHED_ULE to see how the details were implemented
there. Or are you using the older SCHED_4BSD?
Anyway...

Jobs that are ready to run are collected on a ready queue. Since you have a
single CPU, there can only be a single job active on the CPU. When that job is
finished, the scheduler takes the next job in the ready queue and assigns it to
the CPU, etc.

Now, that would cause a much worse situation in your example case. The endless
loop would keep running once it gets the CPU and would never release it. No
other process would ever get a turn again. You wouldn't even be able to get
into such a system in that state using remote ssh.

That is why the scheduler has this "quantum", which limits the maximum
time the CPU will be assigned to a specific job. Once the quantum has expired
(with the job unfinished), the scheduler removes the job from the CPU, puts it
back on the ready queue and assigns the next job from that queue to the CPU.
That's why you seem to get better performance with a smaller value for the
quantum; the endless loop gets forcibly interrupted more often.

This changing of the active job however, involves a context switch for the CPU.
Memory, registers, file handles, etc. that were required by the previous job
needs to be put aside and replaced by any such resources related to the new job
to be run. That uses up time and does nothing to progress the jobs that are
waiting for the CPU. Hence, you don't want the quantum to be too small
either, or you'll end up spending significant time switching contexts. That
gets worse when the job involves system calls, which are handled by the kernel,
which is also a process that needs to be switched (and Meltdown made that worse,
because more rigorous clean-up is necessary to prevent peeks into sections of
memory that were owned by the kernel process previously).

The "correct" value for the quantum depends on your type of workload.
PostgreSQL's auto-vacuum is a typical background process that will probably
(I didn't verify) request to be run at a lower priority, giving other, more
important, jobs more chance to get picked from the ready queue (provided that
the OS implements priority for the ready queue).
That is probably why your endless loop gets much more CPU time than the VACUUM
process. It may be that FreeBSD's default value for the quantum is not
suitable for your workload. Finding the one best suited to you is not
particularly easy though - perhaps FreeBSD allows access to average job times
(below quantum) that can be used to calculate a reasonable average from.

That said, SCHED_ULE (the default scheduler for quite a while now) was designed
with multi-CPU configurations in mind and there are claims that SCHED_4BSD works
better for single-CPU configurations. You may give that a try, if you're not
already on SCHED_4BSD.

A much better option in your case would be to put the database on a multi-core
machine.
> [1]
> A pure-I/O job without compute load, like "dd", does not show
> this behaviour. Also, when other tasks are running, the unjust
> behaviour is not so stongly pronounced.
That is probably because dd has the decency to give the reins back to the
scheduler at regular intervals.

Alban Hertroys
--
If you can't see the forest for the trees,
cut the trees and you'll find there is no forest.

Andriy Gapon

2018-Apr-04 17:48 UTC

head link

kern.sched.quantum: Creepy, sadistic scheduler

On 04/04/2018 03:52, Peter wrote:> Lets run an I/O-active task, e.g, postgres VACUUM that would
> continuousely read from big files (while doing compute as well [1]):
Not everyone has a postgres server and a suitable database.
Could you please devise a test scenario that demonstrates the problem and that
anyone could run?

-- 
Andriy Gapon

Peter

2018-Apr-07 14:18 UTC

head link

more data: SCHED_ULE+PREEMPTION is the problem (was: kern.sched.quantum: Creepy, sadistic scheduler

Hi all,
  in the meantime I did some tests and found the following:


A. The Problem:
---------------
On a single CPU, there are -exactly- two processes runnable:
One is doing mostly compute without I/O - this can be a compressing job 
or similar; in the tests I used simply an endless-loop. Lets call this 
the "piglet".
The other is doing frequent file reads, but also some compute interim -
this can be a backup job traversing the FS, or a postgres VACUUM, or 
some fast compressor like lz4. Lets call this the "worker".

It then happens that the piglet gets 99% CPU, while the worker gets only 
0.5% CPU and makes nearly no progress at all.

Investigations shows that the worker makes precisely one I/O per 
timeslice (timeslice as defined in kern.sched.quantum) - or two I/O on a 
mirrored ZFS.


B. Findings:
------------
1. Filesystem

I could never reproduce this when reading from plain UFS. Only when 
reading from ZFS (direct or via l2arc).

2. Machine

The problem originally appeared on a pentium3 at 1GHz. I was able to 
reproduce it on an i5-3570T, given the following measures:
  * config in BIOS to use only one CPU
  * reduce speed: "dev.cpu.0.freq=200"
I did see the problem also when running full speed (which means it 
happens there also), but could not reproduce it well.

3. kern.sched.preempt_thresh

I could make the problem disappear by changing kern.sched.preempt_thresh 
  from the default 80 to either 11 (i5-3570T) or 7 (p3) or smaller. This 
seems to correspond to the disk interrupt threads, which run at intr:12 
(i5-3570T) or intr:8 (p3).

4. dynamic behaviour

Here the piglet is already running as PID=2119. Then we can watch the 
dynamic behaviour as follows (on i5-3570T at 200MHz):

a. with kern.sched.preempt_thresh=80

$ lz4 DATABASE_TEST_FILE /dev/null & while true; 

   do ps -o pid,pri,"%cpu",command -p 2119,$! 

   sleep 3 

done 

[1] 6073 

  PID PRI %CPU COMMAND 

6073  20  0.0 lz4 DATABASE_TEST_FILE /dev/null 

2119 100 91.0 -bash (bash) 

  PID PRI %CPU COMMAND 

6073  76 15.0 lz4 DATABASE_TEST_FILE /dev/null 

2119  95 74.5 -bash (bash) 

  PID PRI %CPU COMMAND 

6073  52 19.0 lz4 DATABASE_TEST_FILE /dev/null 

2119  94 71.5 -bash (bash) 

  PID PRI %CPU COMMAND 

6073  52 16.0 lz4 DATABASE_TEST_FILE /dev/null 

2119  95 76.5 -bash (bash) 

  PID PRI %CPU COMMAND 

6073  52 14.0 lz4 DATABASE_TEST_FILE /dev/null 

2119  96 80.0 -bash (bash) 

  PID PRI %CPU COMMAND 

6073  52 12.5 lz4 DATABASE_TEST_FILE /dev/null 

2119  96 82.5 -bash (bash) 

  PID PRI %CPU COMMAND 

6073  74 10.0 lz4 DATABASE_TEST_FILE /dev/null 

2119  98 86.5 -bash (bash) 

  PID PRI %CPU COMMAND 

6073  52  8.0 lz4 DATABASE_TEST_FILE /dev/null 

2119  98 89.0 -bash (bash) 

  PID PRI %CPU COMMAND 

6073  52  7.0 lz4 DATABASE_TEST_FILE /dev/null 

2119  98 90.5 -bash (bash) 

  PID PRI %CPU COMMAND 

6073  52  6.5 lz4 DATABASE_TEST_FILE /dev/null 

2119  99 91.5 -bash (bash)

b. with kern.sched.preempt_thresh=11

  PID PRI %CPU COMMAND 

4920  21  0.0 lz4 DATABASE_TEST_FILE /dev/null 

2119 101 93.5 -bash (bash) 

  PID PRI %CPU COMMAND 

4920  78 20.0 lz4 DATABASE_TEST_FILE /dev/null 

2119  94 70.5 -bash (bash) 

  PID PRI %CPU COMMAND 

4920  82 34.5 lz4 DATABASE_TEST_FILE /dev/null 

2119  88 54.0 -bash (bash) 

  PID PRI %CPU COMMAND 

4920  85 42.5 lz4 DATABASE_TEST_FILE /dev/null 

2119  86 45.0 -bash (bash) 

  PID PRI %CPU COMMAND 

4920  85 43.5 lz4 DATABASE_TEST_FILE /dev/null 

2119  86 44.5 -bash (bash) 

  PID PRI %CPU COMMAND 

4920  85 43.0 lz4 DATABASE_TEST_FILE /dev/null 

2119  85 45.0 -bash (bash) 

  PID PRI %CPU COMMAND 

4920  85 43.0 lz4 DATABASE_TEST_FILE /dev/null 

2119  85 45.5 -bash (bash)

 From this we can see that in case b. both processes balance out nicely 
and meet at equal CPU shares.
Whereas in case a., after about 10 Seconds (the first 3 records) they 
move to opposite ends of the scale and stay there.

 From this I might suppose that here is some kind of mis-calculation or 
mis-adjustment of the task priorities happening.

P.

freebsd stable - Apr 2018 - kern.sched.quantum: Creepy, sadistic scheduler

kern.sched.quantum: Creepy, sadistic scheduler

kern.sched.quantum: Creepy, sadistic scheduler

kern.sched.quantum: Creepy, sadistic scheduler

more data: SCHED_ULE+PREEMPTION is the problem (was: kern.sched.quantum: Creepy, sadistic scheduler