thr3ads.net - llvm dev - [LLVMdev] Pre-Allocation Schedulers in LLVM [Sep 2011]

If this information is useful, please help other people find it:
Share via:

Ghassan Shobaki

2011-Sep-17 17:07 UTC

[LLVMdev] Pre-Allocation Schedulers in LLVM

Hi,

I am currently writing a paper 
documenting a research project that we have done on pre-allocation 
instruction scheduling to balance ILP and register pressure. In the 
paper we compare the pre-allocation scheduler that we have developed to 
LLVM's default schedulers for two targets: x86-64 and x86-32. We would 
like to include in our paper some brief descriptions of the two LLVM 
schedulers that we are comparing against and some information about the 
machine model that they are scheduling for.  So, it would be great if 
you could confirm or correct the following information and answer my 
questions below:


The default scheduler for the x86-32 target is the bottom-up register-pressure
reduction (BURR)
scheduler, while for the x86-64 target it is the ILP Scheduler. 
According to the brief documentation in the 
source file ScheduleDAGRRList, the BURR is a register pressure reduction
scheduler, while the ILP is a register-pressure aware scheduler that
tries to balance ILP and register pressure. 

My questions are:

-Are there any references (such as published research) that describe each/any of
these scheduling algorithms?


- By examining the source code, it appears that neither scheduler has a 
machine model describing the functional units and the mapping of 
instructions to functional units on the target x86 machine. Is that 
right?

- Based on the test cases that I have 
analyzed, it looks that the BURR scheduler sets all latencies to 1, 
which essentially eliminates any scheduling for ILP and makes scheduling for
register pressure reduction the only objective of this scheduler.
Can you please confirm or correct this?

- Again based on analyzing test cases, it appears that the ILP scheduler 
sets the latencies of DIV and SQRT (both INT and FP) to 10, while the 
latencies of all other instructions are set to 10. Can you please 
confirm or correct this observation?

Apparently, the developers 
of the ILP scheduler assumed that this rough latency model would be 
sufficient to do ILP scheduling on the x86 target, because the x86 
hardware has a good dynamic scheduler. Our testing, however, shows that 
this is the case for most but not all programs. For one particular 
benchmark with a high-degree of ILP, using more precise latency info 
significantly improved performance. Will the LLVM developers be 
interested in adding more precise latency info for the x86 target?

Thank you in advance!


-Ghassan
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20110917/d6009555/attachment.html>

Andrew Trick

2011-Sep-21 03:54 UTC

head link

[LLVMdev] Pre-Allocation Schedulers in LLVM

On Sep 17, 2011, at 10:07 AM, Ghassan Shobaki wrote:> Hi,
> 
> I am currently writing a paper documenting a research project that we have
done on pre-allocation instruction scheduling to balance ILP and register
pressure. In the paper we compare the pre-allocation scheduler that we have
developed to LLVM's default schedulers for two targets: x86-64 and x86-32.
We would like to include in our paper some brief descriptions of the two LLVM
schedulers that we are comparing against and some information about the machine
model that they are scheduling for.  So, it would be great if you could confirm
or correct the following information and answer my questions below:
> 
> The default scheduler for the x86-32 target is the bottom-up
register-pressure reduction (BURR) scheduler, while for the x86-64 target it is
the ILP Scheduler. According to the brief documentation in the source file
ScheduleDAGRRList, the BURR is a register pressure reduction scheduler, while
the ILP is a register-pressure aware scheduler that tries to balance ILP and
register pressure.
Yes. For those wondering how to find out, grep for
'setSchedulingPreference'.
> My questions are:
> 
> -Are there any references (such as published research) that describe
each/any of these scheduling algorithms?
The LLVM ILP scheduler is a mix of heuristics, some of which may be standard
practice while others are LLVM inventions. The combination of heuristics can
only be described as ad hoc.
> - By examining the source code, it appears that neither scheduler has a
machine model describing the functional units and the mapping of instructions to
functional units on the target x86 machine. Is that right?
The x86 LLVM target does not have a machine pipeline model (instruction
itinerary).
> - Based on the test cases that I have analyzed, it looks that the BURR
scheduler sets all latencies to 1, which essentially eliminates any scheduling
for ILP and makes scheduling for register pressure reduction the only objective
of this scheduler. Can you please confirm or correct this?
You're obviously wondering why this is called an "ILP" scheduler,
which is understandable. I would even argue that the term "scheduler"
is not entirely accurate.

At any rate, this scheduler does not directly increase IPC by avoiding pipeline
stalls as you might expect. What we mean by ILP scheduling is increasing the
number of "in-flight" instructions. This is often the opposite of
register pressure. The key to this is carefully tracking register pressure
(locally) and dependence height. The heuristics decide whether an instruction is
likely to increase register pressure. If not, then we attempt to overlap chains
of dependent instructions.

Note that LLVM's "hybrid" scheduler works much the same as the ILP
scheduler but uses a mix of heuristics that happens to work better for targets
with an intruction itinerary.
> - Again based on analyzing test cases, it appears that the ILP scheduler
sets the latencies of DIV and SQRT (both INT and FP) to 10, while the latencies
of all other instructions are set to 10. Can you please confirm or correct this
observation?
The scheduler is aware that certain operations are "very high
latency", but frankly that knowledge isn't crucial. 10 is used, because
we want a number much greater than 1, and within reasonable expected latency.
Without a pipeline model, it just doesn't make much difference.
> Apparently, the developers of the ILP scheduler assumed that this rough
latency model would be sufficient to do ILP scheduling on the x86 target,
because the x86 hardware has a good dynamic scheduler. Our testing, however,
shows that this is the case for most but not all programs. For one particular
benchmark with a high-degree of ILP, using more precise latency info
significantly improved performance. Will the LLVM developers be interested in
adding more precise latency info for the x86 target?
Free performance? Sure. If you're talking about adding an instruction
itinerary for X86, that should be no problem. A major change to the scheduler
may be harder to swallow. Either way, it would be nice to see the patch and
benchmark data or specific examples of code sequences that improve.

Thanks,
-Andy
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20110920/e438e074/attachment.html>

Ghassan Shobaki

2011-Sep-23 13:16 UTC

head link

[LLVMdev] Pre-Allocation Schedulers in LLVM

Hi Andrew,

What we have is not a patch to any of LLVM's schedulers. We have implemented
our own scheduler and integrated it into LLVM 2.9 as yet-another scheduler. Our
scheduler uses a combinatorial optimization approach to balance ILP and register
pressure. In one experiment, we added more precise latency information for most
common x86 instructions to our scheduler and noticed a 10% performance
improvement on one FP2006 benchmark, namely gromacs. More precisely, we
compared:

(1) LLVM2.9+ourScheduler+preciseLatency 

against 

(2) LLVM2.9+ourScheduler+LLVM's-rough-1-10-latency-model 

And (1) was faster than (2) by 10%. We concluded that adding precise latency
information may significantly improve the performance of programs like gromacs,
which have a high degree of ILP. However, we did not do any performance analysis
to verify that the performance improvement is indeed due to improving ILP
scheduling (reducing pipeline stalls) not due to some other factor, such as
memory performance, which may have gotten coincidentally improved by the
reordering.

We are not currently running any performance analysis tools, but if you are
interested, we can identify the code section whose reordering has led to this
performance improvement and send you the generated code in both cases. We can
probably narrow it down to one basic block fairly easily, because our scheduler
only operates on the hottest 2 functions in that benchmark and those have a
total of 18 basic blocks. BTW, do you guys run SPEC CPU2006? If not, which
benchmarks do you use for evaluating LLVM's performance?

BTW, our scheduler's performance is about 3% faster than LLVM's ILP
scheduler on that benchmark when it uses LLVM's 1-10 latency model. So, with
the precise latency info, our scheduler is about 13% faster than LLVM's ILP
scheduler on that benchmark.

What may be more interesting for you though is finding out whether adding more
precise latency information improves the performance of LLVM's ILP
scheduler. We are interested in answering that question too, and are willing to
add an x86 machine model to LLVM if that task does not turn out to be too
complicated. So, how easy will it be to add an x86 itinerary? Can you point me
to the file or files that have to be added or changed? Where can I find an
example of an existing itinerary for some other target?

BTW, In the above described experiment, we have only added latency information;
we did not add functional-unit information. However, we are interested in adding
both, but we would like to do it in two steps (first latency only then
functional unit info) and measure the impact of each step.

Regards
-Ghassan

________________________________
From: Andrew Trick <atrick at apple.com>
To: Ghassan Shobaki <ghassan_shobaki at yahoo.com>
Cc: "llvmdev at cs.uiuc.edu" <llvmdev at cs.uiuc.edu>
Sent: Wednesday, September 21, 2011 6:54 AM
Subject: Re: [LLVMdev] Pre-Allocation Schedulers in LLVM

On Sep 17, 2011, at 10:07 AM, Ghassan Shobaki wrote:
Hi,>
>
>I am currently writing a paper documenting a research project that we have done on pre-allocation 
instruction scheduling to balance ILP and register pressure. In the 
paper we compare the pre-allocation scheduler that we have developed to 
LLVM's default schedulers for two targets: x86-64 and x86-32. We would 
like to include in our paper some brief descriptions of the two LLVM 
schedulers that we are comparing against and some information about the 
machine model that they are scheduling for.  So, it would be great if 
you could confirm or correct the following information and answer my 
questions below:>
>
>
>The default scheduler for the x86-32 target is the bottom-up
register-pressure reduction (BURR)scheduler, while for the x86-64 target it is the ILP Scheduler. 
According to the brief documentation in the 
source file ScheduleDAGRRList, the BURR is a register pressure reduction
scheduler, while the ILP is a register-pressure aware scheduler that
tries to balance ILP and register pressure. >
Yes. For those wondering how to find out, grep for
'setSchedulingPreference'.

My questions are:>
>
>-Are there any references (such as published research) that describe
each/any of these scheduling algorithms? 
The LLVM ILP scheduler is a mix of heuristics, some of which may be standard
practice while others are LLVM inventions. The combination of heuristics can
only be described as ad hoc.

- By examining the source code, it appears that neither scheduler has a 
machine model describing the functional units and the mapping of 
instructions to functional units on the target x86 machine. Is that 
right?

The x86 LLVM target does not have a machine pipeline model (instruction
itinerary).

- Based on the test cases that I have 
analyzed, it looks that the BURR scheduler sets all latencies to 1, 
which essentially eliminates any scheduling for ILP and makes scheduling for
register pressure reduction the only objective of this scheduler.
Can you please confirm or correct this?

You're obviously wondering why this is called an "ILP" scheduler,
which is understandable. I would even argue that the term "scheduler"
is not entirely accurate.

At any rate, this scheduler does not directly increase IPC by avoiding pipeline
stalls as you might expect. What we mean by ILP scheduling is increasing the
number of "in-flight" instructions. This is often the opposite of
register pressure. The key to this is carefully tracking register pressure
(locally) and dependence height. The heuristics decide whether an instruction is
likely to increase register pressure. If not, then we attempt to overlap chains
of dependent instructions.

Note that LLVM's "hybrid" scheduler works much the same as the ILP
scheduler but uses a mix of heuristics that happens to work better for targets
with an intruction itinerary.

- Again based on analyzing test cases, it appears that the ILP scheduler 
sets the latencies of DIV and SQRT (both INT and FP) to 10, while the 
latencies of all other instructions are set to 10. Can you please 
confirm or correct this observation?

The scheduler is aware that certain operations are "very high
latency", but frankly that knowledge isn't crucial. 10 is used, because
we want a number much greater than 1, and within reasonable expected latency.
Without a pipeline model, it just doesn't make much difference.

Apparently, the developers 
of the ILP scheduler assumed that this rough latency model would be 
sufficient to do ILP scheduling on the x86 target, because the x86 
hardware has a good dynamic scheduler. Our testing, however, shows that 
this is the case for most but not all programs. For one particular 
benchmark with a high-degree of ILP, using more precise latency info 
significantly improved performance. Will the LLVM developers be 
interested in adding more precise latency info for the x86 target?

Free performance? Sure. If you're talking about adding an instruction
itinerary for X86, that should be no problem. A major change to the scheduler
may be harder to swallow. Either way, it would be nice to see the patch and
benchmark data or specific examples of code sequences that improve.

Thanks,
-Andy
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20110923/a906ff7e/attachment.html>

Apparently Analagous Threads

Search for more seemingly similar threads

llvm dev - Sep 2011 - [LLVMdev] Pre-Allocation Schedulers in LLVM

[LLVMdev] Pre-Allocation Schedulers in LLVM

[LLVMdev] Pre-Allocation Schedulers in LLVM

[LLVMdev] Pre-Allocation Schedulers in LLVM

Apparently Analagous Threads