thr3ads.net - llvm dev - [LLVMdev] Experimental Evaluation of the Schedulers in LLVM 3.3 [Sep 2013]

If this information is useful, please help other people find it:
Share via:

Ghassan Shobaki

2013-Jul-02 21:35 UTC

[LLVMdev] MI Scheduler vs SD Scheduler?

Thank you for the answers! We are currently trying to test the MI scheduler. We
are using LLVM 3.3 with Dragon Egg 3.3 on an x86-64 machine. So far, we have run
one SPEC CPU2006 test with the MI scheduler enabled using the option
-fplugin-arg-dragonegg-llvm-option='-enable-misched:true' with -O3. This
enables the machine scheduler in addition to the SD scheduler. We have verified
this by adding print messages to the source code of both schedulers. In terms of
correctness, enabling the MI scheduler did not cause any failure. However, in
terms of performance, we have seen a mix of small positive and negative
differences with the geometric mean difference being near zero. The maximum
improvement that we have seen is 3% on the Gromacs benchmark.  Is this
consistent with your test results?

We have then tried to run a test in which the MI scheduler is enabled but the SD
scheduler is disabled (or neutralized) by adding the option:
-fplugin-arg-dragonegg-llvm-option='-pre-RA-sched:source' to the flags
that we have used in the first test. However, this did not work; we got the
following error message:

GCC_4.6.4_DIR/install/bin/gcc -c -o lbm.o -DSPEC_CPU -DNDEBUG    -O3
-march=core2 -mtune=core2 -fplugin='DRAGON_EGG_DIR/dragonegg.so'
-fplugin-arg-dragonegg-llvm-option='-enable-misched:true'
-fplugin-arg-dragonegg-llvm-option='-pre-RA-sched:source'      
-DSPEC_CPU_LP64         lbm.c
cc1: for the -pre-RA-sched option: may only occur zero or one times!
specmake: *** [lbm.o] Error 1

What does this message mean? 

Is this a bug or we are doing something wrong? 

How can we test the MI scheduler by itself? 

Is it interesting to test 3.3 or there are interesting features that were added
to the trunk after branching 3.3? In the latter case, we are willing to test the
trunk.

Thanks

Ghassan Shobaki

Assistant Professor 

Department of Computer Science 

Princess Sumaya University for Technology 

Amman, Jordan

________________________________
 From: Andrew Trick <atrick at apple.com>
To: Ghassan Shobaki <ghassan_shobaki at yahoo.com> 
Cc: "llvmdev at cs.uiuc.edu" <llvmdev at cs.uiuc.edu> 
Sent: Monday, July 1, 2013 8:10 PM
Subject: Re: MI Scheduler vs SD Scheduler?

Sent from my iPhone

On Jun 28, 2013, at 2:38 PM, Ghassan Shobaki <ghassan_shobaki at
yahoo.com> wrote:

Hi,>
>
>We are currently in the process of upgrading from LLVM 2.9 to LLVM 3.3. We
are working on instruction scheduling(mainly for register pressure reduction). I have been following the llvmdev
mailing list and have learned that a machine instruction (MI)
scheduler has been implemented to replace (or work with?) the selection DAG (SD)
scheduler. However, I could not find any document that describes the new MI
scheduler and how it differs from and relates to the SD scheduler.

MI is now the place to implement any heuristics for profitable scheduling. SD
scheduler will be directly replaced by a new pass that orders the DAG as close
as it can to IR order. We currently emulate this with -pre-RA-sched=source.
The only thing necessarily different about MI sched is that it runs after reg
coalescing and before reg alloc, and maintains live interval analysis. As a
result, register pressure tracking is more accurate. It also uses a new target
interface for precise register pressure. 
MI sched  is intended to be a convenient place to implement target specific
scheduling. There is a generic implementation that uses standard heuristics to
reduce register pressure and balance latency and CPU resources. That is what you
currently get when you enable MI sched for x86. 
The generic heuristics are implemented as a priority function that makes a
greedy choice over the ready instructions based on the current pressure and the
resources and latency of the scheduled and unscheduled set of instructions.
An DAG subtree analysis also exists   (ScheduleDFS), which can be used for
register pressure avoidance. This isn't hooked up to the generic heuristics
yet for lack of interesting test cases.

So, I would appreciate any pointer to a document (or a blog) that may help us
understand the difference and the relation between the two schedulers and figure
out how to deal with them. We are trying to answer the following questions:
>
>
>- A comment at the top of the file ScheduleDAGInstrs says that this file
implements re-scheduling of machine instructions. So, what does re-scheduling
mean?Rescheduling just means optional scheduling. That's really what the comment
should say. It's important to know that MI sched can be skipped for faster
compilation. 

Does it mean that the real scheduling algorithms (such as reg pressure
reduction) are currently implemented in the SD scheduler, while the MI scheduler
does some kind of complementary work (fine tuning) at a lower level
representation of the code? >And what's the future plan? Is it to move the real scheduling algorithms
into the MI scheduler and get rid of the SD scheduler? Will that happen in 3.4
or later?
>I would like to get rid of the SD scheduler so we can reduce compile time by
streamline the scheduling data structures and interfaces. There may be some
objection to doing that in 3.4 if projects haven't been able to migrate. It
will be deprecated though. 

>
>- Based on our initial investigation of the default behavior at -O3 on
x86-64, it appears that the SD scheduler is called while the MI scheduler is
not. That's consistent with the above interpretation of re-scheduling, but
I'd appreciate any advice on what we should do at this point. Should we
integrate our work (an alternate register pressure reduction scheduler) into the
SD scheduler or the MI scheduler?
>
Please refer to my recent messages on llvmdev regarding enabling MI scheduling
by default on x86. 
http://article.gmane.org/gmane.comp.compilers.llvm.devel/63242/match=machinescheduler

I suggest integrating with the MachineScheduler pass.
There are many places to plug in. MachineSchedRegistry provides the hook. At
that point you can define your own ScheduleDAGInstrs or ScheduleDAGMI subclass.
People who only want to define new heuristics should reuse ScheduleDAGMI
directly and only define their own MachineSchedStrategy.

>
>- Our SPEC testing on x86-64 has shown a significant performance improvement
ofLLVM 3.3 relative to LLVM 2.9 (about 5% in geomean on INT2006 and 15% in geomean
on FP2006), but our spill code measurements have shown that
LLVM 3.3 generates significantly more spill code on most benchmarks. We will be
doing more investigation on this, but are there any known facts that explain
this behavior? Is
this caused by a known regression in scheduling and/or allocation (which I
doubt) or by the
implementation (or enabling) of some new optimization(s) that naturally 
increase(s) register pressure?>
>There is not a particular known regression. It's not surprising that
optimizations increase pressure.
Andy

Thank you in advance!>
>
>Ghassan Shobaki
>
>Assistant Professor 
>
>Department of Computer Science 
>
>Princess Sumaya University for Technology 
>
>Amman, Jordan
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20130702/22da3179/attachment.html>

Andrew Trick

2013-Jul-12 07:58 UTC

head link

[LLVMdev] MI Scheduler vs SD Scheduler?

On Jul 2, 2013, at 2:35 PM, Ghassan Shobaki <ghassan_shobaki at yahoo.com>
wrote:
> Thank you for the answers! We are currently trying to test the MI
scheduler. We are using LLVM 3.3 with Dragon Egg 3.3 on an x86-64 machine. So
far, we have run one SPEC CPU2006 test with the MI scheduler enabled using the
option -fplugin-arg-dragonegg-llvm-option='-enable-misched:true' with
-O3. This enables the machine scheduler in addition to the SD scheduler. We have
verified this by adding print messages to the source code of both schedulers. In
terms of correctness, enabling the MI scheduler did not cause any failure.
However, in terms of performance, we have seen a mix of small positive and
negative differences with the geometric mean difference being near zero. The
maximum improvement that we have seen is 3% on the Gromacs benchmark.  Is this
consistent with your test results?
I haven’t benchmarked fortran. On x86-64, I regularly see wild swings in
performance, 10-20% for small codegen changes (small benchmarks with a primary
hot loop). This is not a natural consequence of scheduling, unless spill code
changed in the hot loop (rare on x86-64). Quite often, a somewhat random change
in copy coalescing results in different register allocation and code layout. The
results are chaotic and very platform (linker) and microarchitecture specific.
Large benchmarks are immune to wild swings, but the small changes you see could
just be the accumulation of chaotic behavior of individual loops. It’s hard for
me to draw conclusions without looking at hardware counters and isolating the
data to individual loops.

The MI scheduler’s generic heuristics are much more about avoiding worst-case
scheduling in pathological situations (very large unrolled loops) than it is
about tuning for a microarchitecture. People who want to do that may want to
plugin their own scheduling strategy. The precise machine model and register
pressure information is all there now.

The broadest statement I can make is that we should not unnecessarily spill
within loops (with rare exceptions). If you see that, file a bug. I know there
are still situations that we don’t handle well, but haven’t had a compelling
enough reason to add the complexity to the generic heuristics. If good test
cases come in, then I’ll do that.
> We have then tried to run a test in which the MI scheduler is enabled but
the SD scheduler is disabled (or neutralized) by adding the option:
-fplugin-arg-dragonegg-llvm-option='-pre-RA-sched:source' to the flags
that we have used in the first test. However, this did not work; we got the
following error message:
> 
> GCC_4.6.4_DIR/install/bin/gcc -c -o lbm.o -DSPEC_CPU -DNDEBUG    -O3
-march=core2 -mtune=core2 -fplugin='DRAGON_EGG_DIR/dragonegg.so'
-fplugin-arg-dragonegg-llvm-option='-enable-misched:true'
-fplugin-arg-dragonegg-llvm-option='-pre-RA-sched:source'      
-DSPEC_CPU_LP64         lbm.c
> cc1: for the -pre-RA-sched option: may only occur zero or one times!
> specmake: *** [lbm.o] Error 1
> 
> What does this message mean? 
> Is this a bug or we are doing something wrong? 
I’m not sure why the driver is telling you this. Maybe someone familiar with
dragonegg can help?

You can always rebuild llvm with the enableMachineScheduler() hook implemented.
http://article.gmane.org/gmane.comp.compilers.llvm.devel/63242/match=machinescheduler

Then -enable-misched=true/false simply toggles MI Sched without changing
anything else.
> How can we test the MI scheduler by itself? 
> Is it interesting to test 3.3 or there are interesting features that were
added to the trunk after branching 3.3? In the latter case, we are willing to
test the trunk.
It doesn’t look like my June checkins made it into 3.3. If you’re enabling MI
Sched, and actually evaluating performance of the default heuristics, then it’s
best to use trunk.

-Andy
> 
> Thanks
> 
> Ghassan Shobaki
> Assistant Professor 
> Department of Computer Science 
> Princess Sumaya University for Technology 
> Amman, Jordan
> 
> 
> From: Andrew Trick <atrick at apple.com>
> To: Ghassan Shobaki <ghassan_shobaki at yahoo.com> 
> Cc: "llvmdev at cs.uiuc.edu" <llvmdev at cs.uiuc.edu> 
> Sent: Monday, July 1, 2013 8:10 PM
> Subject: Re: MI Scheduler vs SD Scheduler?
> 
> 
> Sent from my iPhone
> 
> On Jun 28, 2013, at 2:38 PM, Ghassan Shobaki <ghassan_shobaki at
yahoo.com> wrote:
> 
>> Hi,
>> 
>> We are currently in the process of upgrading from LLVM 2.9 to LLVM 3.3.
We are working on instruction scheduling (mainly for register pressure
reduction). I have been following the llvmdev mailing list and have learned that
a machine instruction (MI) scheduler has been implemented to replace (or work
with?) the selection DAG (SD) scheduler. However, I could not find any document
that describes the new MI scheduler and how it differs from and relates to the
SD scheduler.
> 
> MI is now the place to implement any heuristics for profitable scheduling.
SD scheduler will be directly replaced by a new pass that orders the DAG as
close as it can to IR order. We currently emulate this with
-pre-RA-sched=source.
> The only thing necessarily different about MI sched is that it runs after
reg coalescing and before reg alloc, and maintains live interval analysis. As a
result, register pressure tracking is more accurate. It also uses a new target
interface for precise register pressure.
> MI sched  is intended to be a convenient place to implement target specific
scheduling. There is a generic implementation that uses standard heuristics to
reduce register pressure and balance latency and CPU resources. That is what you
currently get when you enable MI sched for x86.
> The generic heuristics are implemented as a priority function that makes a
greedy choice over the ready instructions based on the current pressure and the
resources and latency of the scheduled and unscheduled set of instructions.
> An DAG subtree analysis also exists   (ScheduleDFS), which can be used for
register pressure avoidance. This isn't hooked up to the generic heuristics
yet for lack of interesting test cases.
> 
>> So, I would appreciate any pointer to a document (or a blog) that may
help us understand the difference and the relation between the two schedulers
and figure out how to deal with them. We are trying to answer the following
questions:
>> 
>> - A comment at the top of the file ScheduleDAGInstrs says that this
file implements re-scheduling of machine instructions. So, what does
re-scheduling mean?
> 
> Rescheduling just means optional scheduling. That's really what the
comment should say. It's important to know that MI sched can be skipped for
faster compilation.
> 
>> Does it mean that the real scheduling algorithms (such as reg pressure
reduction) are currently implemented in the SD scheduler, while the MI scheduler
does some kind of complementary work (fine tuning) at a lower level
representation of the code?
>> And what's the future plan? Is it to move the real scheduling
algorithms into the MI scheduler and get rid of the SD scheduler? Will that
happen in 3.4 or later?
> 
> I would like to get rid of the SD scheduler so we can reduce compile time
by streamline the scheduling data structures and interfaces. There may be some
objection to doing that in 3.4 if projects haven't been able to migrate. It
will be deprecated though.
> 
>> 
>> - Based on our initial investigation of the default behavior at -O3 on
x86-64, it appears that the SD scheduler is called while the MI scheduler is
not. That's consistent with the above interpretation of re-scheduling, but
I'd appreciate any advice on what we should do at this point. Should we
integrate our work (an alternate register pressure reduction scheduler) into the
SD scheduler or the MI scheduler?
> 
> Please refer to my recent messages on llvmdev regarding enabling MI
scheduling by default on x86.
>
http://article.gmane.org/gmane.comp.compilers.llvm.devel/63242/match=machinescheduler
> 
> I suggest integrating with the MachineScheduler pass.
> There are many places to plug in. MachineSchedRegistry provides the hook.
At that point you can define your own ScheduleDAGInstrs or ScheduleDAGMI
subclass. People who only want to define new heuristics should reuse
ScheduleDAGMI directly and only define their own MachineSchedStrategy.
> 
>> 
>> - Our SPEC testing on x86-64 has shown a significant performance
improvement of LLVM 3.3 relative to LLVM 2.9 (about 5% in geomean on INT2006 and
15% in geomean on FP2006), but our spill code measurements have shown that LLVM
3.3 generates significantly more spill code on most benchmarks. We will be doing
more investigation on this, but are there any known facts that explain this
behavior? Is this caused by a known regression in scheduling and/or allocation
(which I doubt) or by the implementation (or enabling) of some new
optimization(s) that naturally increase(s) register pressure?
>> 
> There is not a particular known regression. It's not surprising that
optimizations increase pressure.
> 
> Andy
> 
>> Thank you in advance!
>> 
>> Ghassan Shobaki
>> Assistant Professor 
>> Department of Computer Science 
>> Princess Sumaya University for Technology 
>> Amman, Jordan
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20130712/ad8bdccc/attachment.html>

Ghassan Shobaki

2013-Sep-17 18:04 UTC

head link

[LLVMdev] Experimental Evaluation of the Schedulers in LLVM 3.3

Hi Andy,

We have done some experimental evaluation of the different schedulers in LLVM
3.3 (source, BURR, ILP, fast, MI). The evaluation was done
on x86-64 using SPEC CPU2006. We have measured both the amount of spill code as
well as the execution time as detailed below.

Here are our main findings:

1. The SD schedulers significantly impact the spill counts and the execution
times for many benchmarks, but the machine instruction (MI) scheduler in 3.3
has very limited impact on both spill counts and execution times. Is this
because most of you work on MI did not make it into the 3.3 release? We
don't have a strong motivation to test the trunk at this point (we'll
wait for 3.4), because we are working on a
publication and prefer to base that on an official release. However, if you tell
me that you expect things to be significantly different in the trunk, we'll
try
to find the time to give that a shot (unfortunately, we only have one test
machine, and SPEC tests take a lot of time as detailed below).


2. The BURR scheduler gives the minimum amount of spill code
and the best overall execution time (SPEC geo-mean).

3. The source scheduler is the second best scheduler in terms of spill code and
execution time, and its performance is very close to that of BURR in both
metrics. This result is surprising for me, because, as far as I understand,
this scheduler is a conservative scheduler that tries to preserve the original
program order, isn't it? Does this result surprise you?

4. The ILP scheduler has the worst execution times on FP2006 and the second
worst spill counts, although it is the default on x86-64. Is this surprising?
BTW, Dragon Egg sets the scheduler to source. On Line 368 in Backend.cpp, we
find:
if (!flag_schedule_insns)
    Args.push_back("--pre-RA-sched=source");   

Here are the details of our results:

Spill Counts
---------------
CPU2006 has a total of 47448 functions, out of which 10363
functions (22%) have spills. If we break this down by FP and INT, we’ll see
that 42% of the functions in FP2006 have spills, while 10% of the functions in
INT2006 have spills. The amount of spill code was measured by printing the
number of ranges spilled by the default (greedy) register allocator (printing
the variable NumSpilledRanges in InlineSpiller.cpp). This is not a perfectly
accurate
metric, but, given the large sample size (> 10K functions), the total number
of spills across such a statistically significant sample is believed to give a
very strong indication about each scheduler's performance at reducing
register
pressure. The differences in the table below are calculated relative to the
source scheduler. 
 
Heuristic Total Source      
  Spills Spills Spill Difference % Spill Difference 
Source 294471 294471 0 0.00% 
ILP 298222 294471 3751 1.27% 
BURR 287932 294471 -6539 -2.22% 
Fast 312787 294471 18316 6.22% 
source + MI 294979 294471 508 0.17% 
ILP + MI 296681 294471 2210 0.75% 
BURR + MI 289328 294471 -5143 -1.75% 
Fast + MI 302131 294471 7660 2.60% 

So, the best register pressure reduction scheduler is BURR. Note that enabling
the MI scheduler makes things better when the SD scheduler is relatively weak
at reducing register pressure (fast or ILP), while it makes things worse when
the SD scheduler is relatively good at reducing register pressure (BURR or
source).


Execution Times
---------------------
Execution times were measured by running the benchmarks on
an x86-64 machine with 5 or 9 iterations per benchmark as indicated below (in
most cases, no significant difference was observed between 9 iterations (which
take about two days) and 5 iterations (which take about one day)). The
differences in the table below are calculated relative to the source scheduler.
The %Diff Max (Min) is the maximum (minimum) percentage difference on a single
benchmark between each scheduler and the source scheduler. These numbers show
the differences on individual FP benchmarks can be quite significant.


Heuristic FP %Diff FP %Diff  FP %Diff INT %Diff INT %Diff INT %Diff iterations 
  Geo-mean Max Min Geo-mean Max Min   
source 0.00% 0.00% 0.00% 0.00% 0.00% 0.00%   
ILP -2.02% 2.30% -22.04% 0.42% 3.61% -2.16% 9 iterations 
BURR 0.70% 8.62% -5.56% 0.66% 3.09% -1.40% 9 iteratation 
fast -1.34% 9.48% -6.72% 0.12% 3.09% -2.34% 5 iterations 
source + MI 0.21% 3.42% -1.26% -0.01% 0.83% -0.94% 5 iterations 
 
Most of the above performance differences have been correlated with significant
changes in spill counts in hot functions. Note that the ILP scheduler causes a
degradation of 22% on
one benchmark (lbm) relative to the source scheduler. We have verified that this
happens because of poor scheduling that increases the register pressure and thus
leads to generating excessive spills in this benchmark’s
hottest loop. We should
probably report this as a performance bug if ILP stays the default scheduler on
x86-64.

Regards

Ghassan Shobaki
Assistant Professor 
Department of Computer Science 
PrincessSumayaUniversity for Technology 
Amman, Jordan
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20130917/a5c1400e/attachment.html>

Seemingly Similar Threads

Search for more seemingly similar threads

llvm dev - Sep 2013 - [LLVMdev] Experimental Evaluation of the Schedulers in LLVM 3.3

[LLVMdev] MI Scheduler vs SD Scheduler?

[LLVMdev] MI Scheduler vs SD Scheduler?

[LLVMdev] Experimental Evaluation of the Schedulers in LLVM 3.3

Seemingly Similar Threads