thr3ads.net - freebsd stable - kern.sched.quantum: Creepy, sadistic scheduler [Apr 2018]

If this information is useful, please help other people find it:
Share via:

Alban Hertroys

2018-Apr-04 10:39 UTC

kern.sched.quantum: Creepy, sadistic scheduler

> On 4 Apr 2018, at 2:52, Peter <pmc at citylink.dinoex.sub.org> wrote:
> 
> Occasionally I noticed that the system would not quickly process the
> tasks i need done, but instead prefer other, longrunning tasks. I
> figured it must be related to the scheduler, and decided it hates me.
If it hated you, it would behave much worse.
> A closer look shows the behaviour as follows (single CPU):
A single CPU? That's becoming rare! Is that a VM? Old hardware? Something
really specific?
> Lets run an I/O-active task, e.g, postgres VACUUM that would
And you're running a multi-process database server on it no less. That is
going to hurt, no matter how well the scheduler works.
> continuousely read from big files (while doing compute as well [1]):
> >pool        alloc   free   read  write   read  write
> >cache           -      -      -      -      -      -
> >  ada1s4    7.08G  10.9G  1.58K      0  12.9M      0
> 
> Now start an endless loop:
> # while true; do :; done
> 
> And the effect is:
> >pool        alloc   free   read  write   read  write
> >cache           -      -      -      -      -      -
> >  ada1s4    7.08G  10.9G      9      0  76.8K      0
> 
> The VACUUM gets almost stuck! This figures with WCPU in "top":
> 
> >  PID USERNAME   PRI NICE   SIZE    RES STATE    TIME    WCPU COMMAND
> >85583 root        99    0  7044K  1944K RUN      1:06  92.21% bash
> >53005 pgsql       52    0   620M 91856K RUN      5:47   0.50% postgres
> 
> Hacking on kern.sched.quantum makes it quite a bit better:
> # sysctl kern.sched.quantum=1
> kern.sched.quantum: 94488 -> 7874
> 
> >pool        alloc   free   read  write   read  write
> >cache           -      -      -      -      -      -
> >  ada1s4    7.08G  10.9G    395      0  3.12M      0
> 
> >  PID USERNAME   PRI NICE   SIZE    RES STATE    TIME    WCPU COMMAND
> >85583 root        94    0  7044K  1944K RUN      4:13  70.80% bash
> >53005 pgsql       52    0   276M 91856K RUN      5:52  11.83% postgres
> 
> 
> Now, as usual, the "root-cause" questions arise: What exactly
does
> this "quantum"? Is this solution a workaround, i.e. actually
something
> else is wrong, and has it tradeoff in other situations? Or otherwise,
> why is such a default value chosen, which appears to be ill-deceived?
> 
> The docs for the quantum parameter are a bit unsatisfying - they say
> its the max num of ticks a process gets - and what happens when
> they're exhausted? If by default the endless loop is actually allowed
> to continue running for 94k ticks (or 94ms, more likely) uninterrupted,
> then that explains the perceived behaviour - buts thats certainly not
> what a scheduler should do when other procs are ready to run.
I can answer this from the operating systems course I followed recently. This
does not apply to FreeBSD specifically, it is general job scheduling theory. I
still need to read up on SCHED_ULE to see how the details were implemented
there. Or are you using the older SCHED_4BSD?
Anyway...

Jobs that are ready to run are collected on a ready queue. Since you have a
single CPU, there can only be a single job active on the CPU. When that job is
finished, the scheduler takes the next job in the ready queue and assigns it to
the CPU, etc.

Now, that would cause a much worse situation in your example case. The endless
loop would keep running once it gets the CPU and would never release it. No
other process would ever get a turn again. You wouldn't even be able to get
into such a system in that state using remote ssh.

That is why the scheduler has this "quantum", which limits the maximum
time the CPU will be assigned to a specific job. Once the quantum has expired
(with the job unfinished), the scheduler removes the job from the CPU, puts it
back on the ready queue and assigns the next job from that queue to the CPU.
That's why you seem to get better performance with a smaller value for the
quantum; the endless loop gets forcibly interrupted more often.

This changing of the active job however, involves a context switch for the CPU.
Memory, registers, file handles, etc. that were required by the previous job
needs to be put aside and replaced by any such resources related to the new job
to be run. That uses up time and does nothing to progress the jobs that are
waiting for the CPU. Hence, you don't want the quantum to be too small
either, or you'll end up spending significant time switching contexts. That
gets worse when the job involves system calls, which are handled by the kernel,
which is also a process that needs to be switched (and Meltdown made that worse,
because more rigorous clean-up is necessary to prevent peeks into sections of
memory that were owned by the kernel process previously).

The "correct" value for the quantum depends on your type of workload.
PostgreSQL's auto-vacuum is a typical background process that will probably
(I didn't verify) request to be run at a lower priority, giving other, more
important, jobs more chance to get picked from the ready queue (provided that
the OS implements priority for the ready queue).
That is probably why your endless loop gets much more CPU time than the VACUUM
process. It may be that FreeBSD's default value for the quantum is not
suitable for your workload. Finding the one best suited to you is not
particularly easy though - perhaps FreeBSD allows access to average job times
(below quantum) that can be used to calculate a reasonable average from.

That said, SCHED_ULE (the default scheduler for quite a while now) was designed
with multi-CPU configurations in mind and there are claims that SCHED_4BSD works
better for single-CPU configurations. You may give that a try, if you're not
already on SCHED_4BSD.

A much better option in your case would be to put the database on a multi-core
machine.
> [1]
> A pure-I/O job without compute load, like "dd", does not show
> this behaviour. Also, when other tasks are running, the unjust
> behaviour is not so stongly pronounced.
That is probably because dd has the decency to give the reins back to the
scheduler at regular intervals.

Alban Hertroys
--
If you can't see the forest for the trees,
cut the trees and you'll find there is no forest.

Stefan Esser

2018-Apr-04 13:30 UTC

head link

Try setting kern.sched.preempt_thresh != 0 (was: Re: kern.sched.quantum: Creepy, sadistic scheduler)

Am 04.04.18 um 12:39 schrieb Alban Hertroys:> 
>> On 4 Apr 2018, at 2:52, Peter <pmc at citylink.dinoex.sub.org>
wrote:
>>
>> Occasionally I noticed that the system would not quickly process the
>> tasks i need done, but instead prefer other, longrunning tasks. I
>> figured it must be related to the scheduler, and decided it hates me.
> 
> If it hated you, it would behave much worse.
> 
>> A closer look shows the behaviour as follows (single CPU):
> 
> A single CPU? That's becoming rare! Is that a VM? Old hardware?
Something really specific?
> 
>> Lets run an I/O-active task, e.g, postgres VACUUM that would
> 
> And you're running a multi-process database server on it no less. That
is going to hurt, no matter how well the scheduler works.
> 
>> continuousely read from big files (while doing compute as well [1]):
>>> pool        alloc   free   read  write   read  write
>>> cache           -      -      -      -      -      -
>>>  ada1s4    7.08G  10.9G  1.58K      0  12.9M      0
>>
>> Now start an endless loop:
>> # while true; do :; done
>>
>> And the effect is:
>>> pool        alloc   free   read  write   read  write
>>> cache           -      -      -      -      -      -
>>>  ada1s4    7.08G  10.9G      9      0  76.8K      0
>>
>> The VACUUM gets almost stuck! This figures with WCPU in
"top":
>>
>>>  PID USERNAME   PRI NICE   SIZE    RES STATE    TIME    WCPU
COMMAND
>>> 85583 root        99    0  7044K  1944K RUN      1:06  92.21% bash
>>> 53005 pgsql       52    0   620M 91856K RUN      5:47   0.50%
postgres
>>
>> Hacking on kern.sched.quantum makes it quite a bit better:
>> # sysctl kern.sched.quantum=1
>> kern.sched.quantum: 94488 -> 7874
>>
>>> pool        alloc   free   read  write   read  write
>>> cache           -      -      -      -      -      -
>>>  ada1s4    7.08G  10.9G    395      0  3.12M      0
>>
>>>  PID USERNAME   PRI NICE   SIZE    RES STATE    TIME    WCPU
COMMAND
>>> 85583 root        94    0  7044K  1944K RUN      4:13  70.80% bash
>>> 53005 pgsql       52    0   276M 91856K RUN      5:52  11.83%
postgres
>>
>>
>> Now, as usual, the "root-cause" questions arise: What exactly
does
>> this "quantum"? Is this solution a workaround, i.e. actually
something
>> else is wrong, and has it tradeoff in other situations? Or otherwise,
>> why is such a default value chosen, which appears to be ill-deceived?
>>
>> The docs for the quantum parameter are a bit unsatisfying - they say
>> its the max num of ticks a process gets - and what happens when
>> they're exhausted? If by default the endless loop is actually
allowed
>> to continue running for 94k ticks (or 94ms, more likely) uninterrupted,
>> then that explains the perceived behaviour - buts thats certainly not
>> what a scheduler should do when other procs are ready to run.
> 
> I can answer this from the operating systems course I followed recently.
This does not apply to FreeBSD specifically, it is general job scheduling
theory. I still need to read up on SCHED_ULE to see how the details were
implemented there. Or are you using the older SCHED_4BSD?
> Anyway...
> 
> Jobs that are ready to run are collected on a ready queue. Since you have a
single CPU, there can only be a single job active on the CPU. When that job is
finished, the scheduler takes the next job in the ready queue and assigns it to
the CPU, etc.
I'm guessing that the problem is caused by kern.sched.preempt_thresh=0,
which
prevents preemption of low priority processes by interactive or I/O bound
processes.

For a quick test try:

# sysctl kern.sched.preempt_thresh=1

to see whether it makes a difference. The value 1 is unreasonably low, but it
has the most visible effect in that any higher priority process can steal the
CPU from any lower priority one (high priority corresponds to low PRI values
as displayed by ps -l or top).

Reasonable values might be in the range of 80 to 224 depending on the system
usage scenario (that's what I found to have been suggested in the
mail-lists).

Higher values result in less preemption.

Regards, STefan

George Mitchell

2018-Apr-04 13:32 UTC

head link

kern.sched.quantum: Creepy, sadistic scheduler

On 04/04/18 06:39, Alban Hertroys wrote:> [...]
> That said, SCHED_ULE (the default scheduler for quite a while now) was
designed with multi-CPU configurations in mind and there are claims that
SCHED_4BSD works better for single-CPU configurations. You may give that a try,
if you're not already on SCHED_4BSD.
> [...]
A small, disgruntled community of FreeBSD users who have never seen
proof that SCHED_ULE is better than SCHED_4BSD in any environment
continue to regularly recompile with SCHED_4BSD.  I dread the day when
that becomes impossible, but at least it isn't here yet.      -- George

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: OpenPGP digital signature
URL:
<http://lists.freebsd.org/pipermail/freebsd-stable/attachments/20180404/214e136d/attachment.sig>

Peter

2018-Apr-04 14:34 UTC

head link

kern.sched.quantum: Creepy, sadistic scheduler

Hi Alban!

Alban Hertroys wrote:>> Occasionally I noticed that the system would not quickly process the
>> tasks i need done, but instead prefer other, longrunning tasks. I
>> figured it must be related to the scheduler, and decided it hates me.
>
> If it hated you, it would behave much worse.
Thats encouraging :)  But I would say, running a job 100 times slower 
than expected is quite an amount of hate for my taste.
>> A closer look shows the behaviour as follows (single CPU):
>
> A single CPU? That's becoming rare! Is that a VM? Old hardware?
Something really specific?
I don't plug in another CPU because there is no need to. Yes, its old 
hardware:
CPU: Intel Pentium III (945.02-MHz 686-class CPU)
ACPI APIC Table: <ASUS   CUV4XDLS>

If I had bought new hardware, this one would now rot in Africa, and I 
would have new hardware idling along that is spectre/meltdown affected 
nevertheless.
>> Lets run an I/O-active task, e.g, postgres VACUUM that would
>
> And you're running a multi-process database server on it no less. That
> is going to hurt,
I'm running a lot more than only that on it. But it's all private use, 
idling most of the time.
> no matter how well the scheduler works.
Maybe. But this post is not about my personal expectations on over-all 
performance - it is about a specific behaviour that is not how a 
scheduler is expected to behave - no matter if we're on a PDP-11 or on a 
KabyLake.
>> Now, as usual, the "root-cause" questions arise: What exactly
does
>> this "quantum"? Is this solution a workaround, i.e. actually
something
>> else is wrong, and has it tradeoff in other situations? Or otherwise,
>> why is such a default value chosen, which appears to be ill-deceived?
>>
>> The docs for the quantum parameter are a bit unsatisfying - they say
>> its the max num of ticks a process gets - and what happens when
>> they're exhausted? If by default the endless loop is actually
allowed
>> to continue running for 94k ticks (or 94ms, more likely) uninterrupted,
>> then that explains the perceived behaviour - buts thats certainly not
>> what a scheduler should do when other procs are ready to run.
>
> I can answer this from the operating systems course I followed recently.
This does not apply to FreeBSD specifically, it is general job scheduling
theory. I still need to read up on SCHED_ULE to see how the details were
implemented there. Or are you using the older SCHED_4BSD?
I'm using the default scheduler, which is ULE. I would not go 
non-default without reason. (But it seems, a reason is just appering now.)
> Now, that would cause a much worse situation in your example case. The
endless loop would keep running once it gets the CPU and would never release it.
No other process would ever get a turn again. You wouldn't even be able to
get into such a system in that state using remote ssh.
>
> That is why the scheduler has this "quantum", which limits the
maximum time the CPU will be assigned to a specific job. Once the quantum has
expired (with the job unfinished), the scheduler removes the job from the CPU,
puts it back on the ready queue and assigns the next job from that queue to the
CPU.
> That's why you seem to get better performance with a smaller value for
the quantum; the endless loop gets forcibly interrupted more often.
Good description. Only my (old-fashioned) understanding was that this is 
the purpose of the HZ value: to give control back to the kernel, so that 
a new decision can be made.
So, I would not have been surpized to see 200 I/Os for postgres 
(kern.hz=200), but what I see is 9 I/Os (which indeed figures to a 
"quantum" of 94ms).

But then, we were able to do all this nicely on single-CPU machines for 
almost four decades. It does not make sense to me if now we state that 
we cannot do it anymore because single-CPU is uncommon today.
(Yes indeed, we also cannot fly to the moon anymore, because today 
nobody seems to recall how that stuff was built. *headbangwall*)
> This changing of the active job however, involves a context switch for the
CPU. Memory, registers, file handles, etc. that were required by the previous
job needs to be put aside and replaced by any such resources related to the new
job to be run. That uses up time and does nothing to progress the jobs that are
waiting for the CPU. Hence, you don't want the quantum to be too small
either, or you'll end up spending significant time switching contexts.
Yepp. My understanding was that I can influence this behaviour via the 
HZ value, so to tradeoff responsiveness against performance. Obviousely 
that was wrong.
 From Your writing, it seems the "quantum" is indeed the correct place
to tune this. (But I will still have to ponder a while about the knob 
mentioned by Stefan, concerning preemption, which seems to magically 
resolve the issue.)
> That said, SCHED_ULE (the default scheduler for quite a while now) was
designed with multi-CPU configurations in mind and there are claims that
SCHED_4BSD works better for single-CPU configurations. You may give that a try,
if you're not already on SCHED_4BSD.
I'll try this at next code update.
> A much better option in your case would be to put the database on a
multi-core machine.
I could plug in the second CPU, but it would mostly just heat the room.
So, a modern low-energy CPU would do better - but then, try to get a 
modern CPU (+board!) that supports ECC-ram, and you'll end in the 
high-end department. This old one does, so it's just perfect for 24/365.

Nevertheless, this is a software issue, and fixing it via new hardware 
should be only the last resort.
>> [1]
>> A pure-I/O job without compute load, like "dd", does not show
>> this behaviour. Also, when other tasks are running, the unjust
>> behaviour is not so stongly pronounced.
>
> That is probably because dd has the decency to give the reins back to the
scheduler at regular intervals.
No, rather the other way round: running dd against the piglet (aka 
endless loop), both run full-speed. Running postgres VACUUM against the 
piglet, postgres starves.
My (rather vague) explanation: when an I/O for dd comes back, dd 
immediately requests the next one. When an I/O for postgres comes back, 
postgres needs to compute their transaction ID stuff, competes against 
the piglet for CPU, and looses.

P.

freebsd stable - Apr 2018 - kern.sched.quantum: Creepy, sadistic scheduler

kern.sched.quantum: Creepy, sadistic scheduler

Try setting kern.sched.preempt_thresh != 0 (was: Re: kern.sched.quantum: Creepy, sadistic scheduler)

kern.sched.quantum: Creepy, sadistic scheduler

kern.sched.quantum: Creepy, sadistic scheduler