thr3ads.net - llvm dev - [LLVMdev] LLVM's Pre-allocation Scheduler Tested against a Branch-and-Bound Scheduler [Sep 2012]

If this information is useful, please help other people find it:
Share via:

Ghassan Shobaki

2012-Sep-29 09:43 UTC

[LLVMdev] LLVM's Pre-allocation Scheduler Tested against a Branch-and-Bound Scheduler

Hi,

We are currently working on revising a journal article that describes
 our work on pre-allocation scheduling using LLVM and have some questions about
LLVM's pre-allocation scheduler. The answers to these question will help us
better document and analyze the results of our benchmark tests that compare our
algorithm with LLVM's pre-allocation scheduling algorithm.

First, here is a brief description of our work:

We have developed a combinatorial algorithm for balancing instruction-level
parallelism (ILP) and register pressure (RP) during pre-allocation scheduling.
The algorithm is based on a branch-and-bound (B&B) approach, wherein the
objective function is a linear combination of schedule length and register
pressure. We have implemented this algorithm and integrated it into LLVM 2.9 as
an alternate pre-allocation scheduler. Then we compared the performance of our
(B&B) scheduler with that of LLVM's default scheduler on x86 (BURR
scheduler on x86-32 and ILP on x86-64) using SPEC CPU2006. The results show that
our B&B scheduler significantly improves the performance of some benchmarks
relative to LLVM's default scheduler by up to 21%. The geometric-mean
speedup on FP2006 is about 2.4% across the entire suite. We then observed that
LLVM's ILP scheduler on x86-64 uses "rough" latency values. So, we
added the precise latency values published by Agner
 (http://www.agner.org/optimize/) and that led to more speedup relative to
LLVM's ILP scheduler on some benchmarks. The most significant gain from
adding precise latencies was on the gromacs benchmark, which has a high degree
of ILP. I am attaching the benchmarking results on x86-64 using both LLVM's
rough latencies and Agner's precise latencies:

This work makes two points:


-A B&B 
algorithm can discover significantly better schedules than a heuristic 
can do for some larger hard-to-schedule blocks, and if such blocks 
happen to occur in hot code, their scheduling will have a substantial 
impact on 
performance.
- A B&B algorithm is generally slower than a heuristic, but 
it is not a slow as most people think. By applying such an algorithm 
selectively to the hot blocks that are likely to benefit from it and 
setting some compile-time budget, a significant performance gain may be 
achieved with a relatively small increase in compile time.


My questions are:

1. Our current experimental results are based on LLVM 2.9. We 
definitely plan on upgrading to the latest LLVM version in our future 
work, but is there a fundamentally compelling reason for us to upgrade now to
3.1 for the sake of making the above points in the publication?


2. The BURR scheduler on x86-32 appears to set all latencies to one (which makes
it a pure RR scheduler with no ILP), while the ILP scheduler on
x86-64 appears to set all latencies to 10 expect for a few long-latency 
instructions. For the sake of documenting this in the paper, does anyone know
(or can point
me to) a precise description of how the scheduler sets latency values? 
In the revised paper, I will add experimental results based on precise 
latency values (see the attached spreadsheet) and would like to clearly 
document how LLVM's rough latencies for x86 are determined.


3. Was the choice to use rough latency values in the ILP scheduler based 
on the fact that using precise latencies makes it much harder for a 
heuristic non-backtracking scheduler to balance ILP and RP or the choice was
made simply because nobody bothered to write an x86 itinerary?  

4. Does the ILP scheduler ever consider scheduling a stall (leaving a 
cycle empty) when there are ready instructions? Here is a small hypothetical
example that explains what I mean:

Suppose
 that at Cycle C the register pressure (RP) is equal to the physical 
limit and all ready instructions in that cycle start new live ranges, 
thus increasing the RP above the physical register limit. However, in a 
later cycle C+Delta some instruction X that closes a currently open live
 range will become ready. If the objective is minimizing RP, the right
 choice to make in this case is leaving Cycles C through C+Delta-1 empty
 and scheduling Instruction X in Cycle C+Delta. Otherwise, we will be 
increasing the RP. Does the ILP scheduler ever make such a choice or it will 
always schedule an instruction when the ready list is not empty?


Thank you in advance!

-Ghassan  


Ghassan Shobaki
Assistant Professor
Department of Computer Science
Princess Sumaya University for Technology
Amman, Jordan

Attachments inlined:

Rough Latencies

Benchmark Branch-and-Bound LLVM 
 

 SPEC Score SPEC Score % Score Difference 
400.perlbench 21.2 20.2 4.95% 
401.bzip2 13.9 13.6 2.21% 
403.gcc 19.5 19.8 -1.52% 
429.mcf 20.5 20.5 0.00% 
445.gobmk 18.6 18.6 0.00% 
456.hmmer 11.1 11.1 0.00% 
458.sjeng 19.3 19.3 0.00% 
462.libquantum 39.5 39.5 0.00% 
464.h264ref 28.5 28.5 0.00% 
471.omnetpp 15.6 15.6 0.00% 
473.astar 13 13 0.00% 
483.xalancbmk 21.9 21.9 0.00% 
GEOMEAN 19.0929865 19.00588287     0.46% 
410.bwaves  15.2 15.2 0.00% 
416.gamess CE CE #VALUE! 
433.milc  19 18.6 2.15% 
434.zeusmp    14.2 14.2 0.00% 
435.gromacs       11.6 11.3 2.65% 
436.cactusADM 8.31 7.89 5.32% 
437.leslie3d 11 11 0.00% 
444.namd   16 16 0.00% 
447.dealII 25.4 25.4 0.00% 
450.soplex 26.1 26.1 0.00% 
453.povray 20.5 20.5 0.00% 
454.calculix 8.44 8.3 1.69% 
459.GemsFDTD  10.7 10.7 0.00% 
465.tonto CE CE #VALUE! 
470.lbm 38.1 31.5 20.95% 
481.wrf 11.6 11.6 0.00% 
482.sphinx3 28.2 26.9 4.83% 
GEOMEAN 15.91486307 15.54419555    2.38% 

Precise Latencies

Benchmark Branch-and-Bound LLVM 
 

 SPEC Score SPEC Score % Score Difference 
400.perlbench 21.2 20.2 4.95% 
401.bzip2 13.9 13.6 2.21% 
403.gcc 19.6 19.8 -1.01% 
429.mcf 20.8 20.5 1.46% 
445.gobmk 18.8 18.6 1.08% 
456.hmmer 11.1 11.1 0.00% 
458.sjeng 19.3 19.3 0.00% 
462.libquantum 39.5 39.5 0.00% 
464.h264ref 28.5 28.5 0.00% 
471.omnetpp 15.6 15.6 0.00% 
473.astar 13 13 0.00% 
483.xalancbmk 21.9 21.9 0.00% 
GEOMEAN 19.14131861 19.00588287  
 0.71% 
410.bwaves  15.5 15.2 1.97% 
416.gamess CE CE #VALUE! 
433.milc  19.3 18.6 3.76% 
434.zeusmp    14.2 14.2 0.00% 
435.gromacs       12.4 11.3 9.73% 
436.cactusADM 7.7 7.89 -2.41% 
437.leslie3d 11 11 0.00% 
444.namd   16.2 16 1.25% 
447.dealII 25.4 25.4 0.00% 
450.soplex 26.1 26.1 0.00% 
453.povray 20.5 20.5 0.00% 
454.calculix 8.55 8.3 3.01% 
459.GemsFDTD  10.5 10.7 -1.87% 
465.tonto CE CE #VALUE! 
470.lbm 38.8 31.5 23.17% 
481.wrf 11.6 11.6 0.00% 
482.sphinx3 28 26.9 4.09% 
GEOMEAN 15.96082174 15.54419555     2.68% 
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20120929/517814e8/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: x86-64_BB_vs_LLVM_roughLatencies.xls
Type: application/vnd.ms-excel
Size: 10240 bytes
Desc: not available
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20120929/517814e8/attachment.xls>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: x86-64_BB_vs_LLVM_preciseLatencies.xls
Type: application/vnd.ms-excel
Size: 10240 bytes
Desc: not available
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20120929/517814e8/attachment-0001.xls>

Duncan Sands

2012-Sep-29 10:36 UTC

head link

[LLVMdev] LLVM's Pre-allocation Scheduler Tested against a Branch-and-Bound Scheduler

Hi Ghassan, this is very interesting, however...
> We are currently working on revising a journal article that describes our
work
> on pre-allocation scheduling using LLVM and have some questions about
LLVM's
> pre-allocation scheduler. The answers to these question will help us better
> document and analyze the results of our benchmark tests that compare our
> algorithm with LLVM's pre-allocation scheduling algorithm.
>
> First, here is a brief description of our work:
>
> We have developed a combinatorial algorithm for balancing instruction-level
> parallelism (ILP) and register pressure (RP) during pre-allocation
scheduling.
> The algorithm is based on a branch-and-bound (B&B) approach, wherein
the
> objective function is a linear combination of schedule length and register
> pressure. We have implemented this algorithm and integrated it into LLVM
2.9 as
> an alternate pre-allocation scheduler. Then we compared the performance of
our
> (B&B) scheduler with that of LLVM's default scheduler on x86 (BURR
scheduler on
> x86-32 and ILP on x86-64) using SPEC CPU2006. The results show that our
B&B
> scheduler significantly improves the performance of some benchmarks
relative to
> LLVM's default scheduler by up to 21%.
... are these differences statistically significant?  In my experience SPEC
scores can vary considerably on different runs, so how many runs did you do
and what was the estimated standard deviation?

Best wishes, Duncan.

  The geometric-mean speedup on FP2006 is> about 2.4% across the entire suite. We then observed that LLVM's ILP
scheduler
> on x86-64 uses "rough" latency values. So, we added the precise
latency values
> published by Agner (http://www.agner.org/optimize/) and that led to more
speedup
> relative to LLVM's ILP scheduler on some benchmarks. The most
significant gain
> from adding precise latencies was on the gromacs benchmark, which has a
high
> degree of ILP. I am attaching the benchmarking results on x86-64 using both
> LLVM's rough latencies and Agner's precise latencies:
>
> This work makes two points:
>
> -A B&B algorithm can discover significantly better schedules than a
heuristic
> can do for some larger hard-to-schedule blocks, and if such blocks happen
to
> occur in hot code, their scheduling will have a substantial impact on
performance.
> - A B&B algorithm is generally slower than a heuristic, but it is not a
slow as
> most people think. By applying such an algorithm selectively to the hot
blocks
> that are likely to benefit from it and setting some compile-time budget, a
> significant performance gain may be achieved with a relatively small
increase in
> compile time.
>
> My questions are:
>
> 1. Our current experimental results are based on LLVM 2.9. We definitely
plan on
> upgrading to the latest LLVM version in our future work, but is there a
> fundamentally compelling reason for us to upgrade now to 3.1 for the sake
of
> making the above points in the publication?
>
> 2. The BURR scheduler on x86-32 appears to set all latencies to one (which
makes
> it a pure RR scheduler with no ILP), while the ILP scheduler on x86-64
appears
> to set all latencies to 10 expect for a few long-latency instructions. For
the
> sake of documenting this in the paper, does anyone know (or can point me
to) a
> precise description of how the scheduler sets latency values? In the
revised
> paper, I will add experimental results based on precise latency values (see
the
> attached spreadsheet) and would like to clearly document how LLVM's
rough
> latencies for x86 are determined.
>
> 3. Was the choice to use rough latency values in the ILP scheduler based on
the
> fact that using precise latencies makes it much harder for a heuristic
> non-backtracking scheduler to balance ILP and RP or the choice was made
simply
> because nobody bothered to write an x86 itinerary?
>
> 4. Does the ILP scheduler ever consider scheduling a stall (leaving a cycle
> empty) when there are ready instructions? Here is a small hypothetical
example
> that explains what I mean:
>
> Suppose that at Cycle C the register pressure (RP) is equal to the physical
> limit and all ready instructions in that cycle start new live ranges, thus
> increasing the RP above the physical register limit. However, in a later
cycle
> C+Delta some instruction X that closes a currently open live range will
become
> ready. If the objective is minimizing RP, the right choice to make in this
case
> is leaving Cycles C through C+Delta-1 empty and scheduling Instruction X in
> Cycle C+Delta. Otherwise, we will be increasing the RP. Does the ILP
scheduler
> ever make such a choice or it will always schedule an instruction when the
ready
> list is not empty?
>
> Thank you in advance!
> -Ghassan
>
> Ghassan Shobaki
> Assistant Professor
> Department of Computer Science
> Princess Sumaya University for Technology
> Amman, Jordan
>
> Attachments inlined:
>
> Rough Latencies
>
> Benchmark 	Branch-and-Bound 	LLVM 	
>
> 	SPEC Score 	SPEC Score 	% Score Difference
> 400.perlbench 	21.2 	20.2 	4.95%
> 401.bzip2 	13.9 	13.6 	2.21%
> 403.gcc 	19.5 	19.8 	-1.52%
> 429.mcf 	20.5 	20.5 	0.00%
> 445.gobmk 	18.6 	18.6 	0.00%
> 456.hmmer 	11.1 	11.1 	0.00%
> 458.sjeng 	19.3 	19.3 	0.00%
> 462.libquantum 	39.5 	39.5 	0.00%
> 464.h264ref 	28.5 	28.5 	0.00%
> 471.omnetpp 	15.6 	15.6 	0.00%
> 473.astar 	13 	13 	0.00%
> 483.xalancbmk 	21.9 	21.9 	0.00%
> GEOMEAN 	19.0929865 	19.00588287 	    0.46%
> 410.bwaves	15.2 	15.2 	0.00%
> 416.gamess 	CE 	CE 	#VALUE!
> 433.milc	19 	18.6 	2.15%
> 434.zeusmp	14.2 	14.2 	0.00%
> 435.gromacs	11.6 	11.3 	2.65%
> 436.cactusADM 	8.31 	7.89 	5.32%
> 437.leslie3d 	11 	11 	0.00%
> 444.namd	16 	16 	0.00%
> 447.dealII 	25.4 	25.4 	0.00%
> 450.soplex 	26.1 	26.1 	0.00%
> 453.povray 	20.5 	20.5 	0.00%
> 454.calculix 	8.44 	8.3 	1.69%
> 459.GemsFDTD	10.7 	10.7 	0.00%
> 465.tonto 	CE 	CE 	#VALUE!
> 470.lbm 	38.1 	31.5 	20.95%
> 481.wrf 	11.6 	11.6 	0.00%
> 482.sphinx3 	28.2 	26.9 	4.83%
> GEOMEAN 	15.91486307 	15.54419555 	   2.38%
>
>
> Precise Latencies
>
> Benchmark 	Branch-and-Bound 	LLVM 	
>
> 	SPEC Score 	SPEC Score 	% Score Difference
> 400.perlbench 	21.2 	20.2 	4.95%
> 401.bzip2 	13.9 	13.6 	2.21%
> 403.gcc 	19.6 	19.8 	-1.01%
> 429.mcf 	20.8 	20.5 	1.46%
> 445.gobmk 	18.8 	18.6 	1.08%
> 456.hmmer 	11.1 	11.1 	0.00%
> 458.sjeng 	19.3 	19.3 	0.00%
> 462.libquantum 	39.5 	39.5 	0.00%
> 464.h264ref 	28.5 	28.5 	0.00%
> 471.omnetpp 	15.6 	15.6 	0.00%
> 473.astar 	13 	13 	0.00%
> 483.xalancbmk 	21.9 	21.9 	0.00%
> GEOMEAN 	19.14131861 	19.00588287
> 	0.71%
> 410.bwaves	15.5 	15.2 	1.97%
> 416.gamess 	CE 	CE 	#VALUE!
> 433.milc	19.3 	18.6 	3.76%
> 434.zeusmp	14.2 	14.2 	0.00%
> 435.gromacs	12.4 	11.3 	9.73%
> 436.cactusADM 	7.7 	7.89 	-2.41%
> 437.leslie3d 	11 	11 	0.00%
> 444.namd	16.2 	16 	1.25%
> 447.dealII 	25.4 	25.4 	0.00%
> 450.soplex 	26.1 	26.1 	0.00%
> 453.povray 	20.5 	20.5 	0.00%
> 454.calculix 	8.55 	8.3 	3.01%
> 459.GemsFDTD	10.5 	10.7 	-1.87%
> 465.tonto 	CE 	CE 	#VALUE!
> 470.lbm 	38.8 	31.5 	23.17%
> 481.wrf 	11.6 	11.6 	0.00%
> 482.sphinx3 	28 	26.9 	4.09%
> GEOMEAN 	15.96082174 	15.54419555 	    2.68%
>
>
>
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>

Evan Cheng

2012-Sep-29 18:37 UTC

head link

[LLVMdev] LLVM's Pre-allocation Scheduler Tested against a Branch-and-Bound Scheduler

On Sep 29, 2012, at 2:43 AM, Ghassan Shobaki <ghassan_shobaki at
yahoo.com> wrote:
> Hi,
> 
> We are currently working on revising a journal article that describes our
work on pre-allocation scheduling using LLVM and have some questions about
LLVM's pre-allocation scheduler. The answers to these question will help us
better document and analyze the results of our benchmark tests that compare our
algorithm with LLVM's pre-allocation scheduling algorithm.
> 
> First, here is a brief description of our work:
> 
> We have developed a combinatorial algorithm for balancing instruction-level
parallelism (ILP) and register pressure (RP) during pre-allocation scheduling.
The algorithm is based on a branch-and-bound (B&B) approach, wherein the
objective function is a linear combination of schedule length and register
pressure. We have implemented this algorithm and integrated it into LLVM 2.9 as
an alternate pre-allocation scheduler. Then we compared the performance of our
(B&B) scheduler with that of LLVM's default scheduler on x86 (BURR
scheduler on x86-32 and ILP on x86-64) using SPEC CPU2006. The results show that
our B&B scheduler significantly improves the performance of some benchmarks
relative to LLVM's default scheduler by up to 21%. The geometric-mean
speedup on FP2006 is about 2.4% across the entire suite. We then observed that
LLVM's ILP scheduler on x86-64 uses "rough" latency values. So, we
added the precise latency values published by Agner
(http://www.agner.org/optimize/) and that led to more speedup relative to
LLVM's ILP scheduler on some benchmarks. The most significant gain from
adding precise latencies was on the gromacs benchmark, which has a high degree
of ILP. I am attaching the benchmarking results on x86-64 using both LLVM's
rough latencies and Agner's precise latencies:
> 
> This work makes two points:
> 
> -A B&B algorithm can discover significantly better schedules than a
heuristic can do for some larger hard-to-schedule blocks, and if such blocks
happen to occur in hot code, their scheduling will have a substantial impact on
performance.
> - A B&B algorithm is generally slower than a heuristic, but it is not a
slow as most people think. By applying such an algorithm selectively to the hot
blocks that are likely to benefit from it and setting some compile-time budget,
a significant performance gain may be achieved with a relatively small increase
in compile time.
> 
> My questions are:
> 
> 1. Our current experimental results are based on LLVM 2.9. We definitely
plan on upgrading to the latest LLVM version in our future work, but is there a
fundamentally compelling reason for us to upgrade now to 3.1 for the sake of
making the above points in the publication?
Yes there is. While the pre-allocation scheduler has not had algorithmic changes
during the past year it has received minor tweaks which can impact performance.
Also note the scheduler is on its way out. I don't know when the article
will be published. But it's possible by the time the paper is published, you
would be essentially comparing against deprecated technology.

> 
> 2. The BURR scheduler on x86-32 appears to set all latencies to one (which
makes it a pure RR scheduler with no ILP), while the ILP scheduler on x86-64
appears to set all latencies to 10 expect for a few long-latency instructions.
For the sake of documenting this in the paper, does anyone know (or can point me
to) a precise description of how the scheduler sets latency values? In the
revised paper, I will add experimental results based on precise latency values
(see the attached spreadsheet) and would like to clearly document how LLVM's
rough latencies for x86 are determined.
I don't think your information is correct. The ILP scheduler is not setting
the latencies to 10. LLVM does not have machine models for x86 (except for atom)
so it's using a uniform latency model (one cycle).
> 
> 3. Was the choice to use rough latency values in the ILP scheduler based on
the fact that using precise latencies makes it much harder for a heuristic
non-backtracking scheduler to balance ILP and RP or the choice was made simply
because nobody bothered to write an x86 itinerary?
No one has bothered to write the itinerary.
> 
> 4. Does the ILP scheduler ever consider scheduling a stall (leaving a cycle
empty) when there are ready instructions? Here is a small hypothetical example
that explains what I mean:
> 
> Suppose that at Cycle C the register pressure (RP) is equal to the physical
limit and all ready instructions in that cycle start new live ranges, thus
increasing the RP above the physical register limit. However, in a later cycle
C+Delta some instruction X that closes a currently open live range will become
ready. If the objective is minimizing RP, the right choice to make in this case
is leaving Cycles C through C+Delta-1 empty and scheduling Instruction X in
Cycle C+Delta. Otherwise, we will be increasing the RP. Does the ILP scheduler
ever make such a choice or it will always schedule an instruction when the ready
list is not empty?
I don't believe so.

Evan
> 
> Thank you in advance!
> -Ghassan  
> 
> Ghassan Shobaki
> Assistant Professor
> Department of Computer Science
> Princess Sumaya University for Technology
> Amman, Jordan
> 
> Attachments inlined:
> 
> Rough Latencies
> 
> Benchmark	Branch-and-Bound	LLVM	
> 
> SPEC Score	SPEC Score	% Score Difference
> 400.perlbench	21.2	20.2	4.95%
> 401.bzip2	13.9	13.6	2.21%
> 403.gcc	19.5	19.8	-1.52%
> 429.mcf	20.5	20.5	0.00%
> 445.gobmk	18.6	18.6	0.00%
> 456.hmmer	11.1	11.1	0.00%
> 458.sjeng	19.3	19.3	0.00%
> 462.libquantum	39.5	39.5	0.00%
> 464.h264ref	28.5	28.5	0.00%
> 471.omnetpp	15.6	15.6	0.00%
> 473.astar	13	13	0.00%
> 483.xalancbmk	21.9	21.9	0.00%
> GEOMEAN	19.0929865	19.00588287	    0.46%
> 410.bwaves 	15.2	15.2	0.00%
> 416.gamess	CE	CE	#VALUE!
> 433.milc 	19	18.6	2.15%
> 434.zeusmp   	14.2	14.2	0.00%
> 435.gromacs      	11.6	11.3	2.65%
> 436.cactusADM	8.31	7.89	5.32%
> 437.leslie3d	11	11	0.00%
> 444.namd  	16	16	0.00%
> 447.dealII	25.4	25.4	0.00%
> 450.soplex	26.1	26.1	0.00%
> 453.povray	20.5	20.5	0.00%
> 454.calculix	8.44	8.3	1.69%
> 459.GemsFDTD 	10.7	10.7	0.00%
> 465.tonto	CE	CE	#VALUE!
> 470.lbm	38.1	31.5	20.95%
> 481.wrf	11.6	11.6	0.00%
> 482.sphinx3	28.2	26.9	4.83%
> GEOMEAN	15.91486307	15.54419555	   2.38%
> 
> Precise Latencies
> 
> Benchmark	Branch-and-Bound	LLVM	
> 
> SPEC Score	SPEC Score	% Score Difference
> 400.perlbench	21.2	20.2	4.95%
> 401.bzip2	13.9	13.6	2.21%
> 403.gcc	19.6	19.8	-1.01%
> 429.mcf	20.8	20.5	1.46%
> 445.gobmk	18.8	18.6	1.08%
> 456.hmmer	11.1	11.1	0.00%
> 458.sjeng	19.3	19.3	0.00%
> 462.libquantum	39.5	39.5	0.00%
> 464.h264ref	28.5	28.5	0.00%
> 471.omnetpp	15.6	15.6	0.00%
> 473.astar	13	13	0.00%
> 483.xalancbmk	21.9	21.9	0.00%
> GEOMEAN	19.14131861	19.00588287  
> 0.71%
> 410.bwaves 	15.5	15.2	1.97%
> 416.gamess	CE	CE	#VALUE!
> 433.milc 	19.3	18.6	3.76%
> 434.zeusmp   	14.2	14.2	0.00%
> 435.gromacs      	12.4	11.3	9.73%
> 436.cactusADM	7.7	7.89	-2.41%
> 437.leslie3d	11	11	0.00%
> 444.namd  	16.2	16	1.25%
> 447.dealII	25.4	25.4	0.00%
> 450.soplex	26.1	26.1	0.00%
> 453.povray	20.5	20.5	0.00%
> 454.calculix	8.55	8.3	3.01%
> 459.GemsFDTD 	10.5	10.7	-1.87%
> 465.tonto	CE	CE	#VALUE!
> 470.lbm	38.8	31.5	23.17%
> 481.wrf	11.6	11.6	0.00%
> 482.sphinx3	28	26.9	4.09%
> GEOMEAN	15.96082174	15.54419555	    2.68%
> 
>
<x86-64_BB_vs_LLVM_roughLatencies.xls><x86-64_BB_vs_LLVM_preciseLatencies.xls>_______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20120929/35009f1e/attachment.html>

Ghassan Shobaki

2012-Sep-29 21:46 UTC

head link

[LLVMdev] LLVM's Pre-allocation Scheduler Tested against a Branch-and-Bound Scheduler

Duncan,

Yes, there is a significant random variance among runs for some (but not all)
SPEC benchmarks (examples are libquantum, bwaves and cactus). However, we run a
complete SPEC test with three or five iterations after every significant change
we make to the code and make sure that we reproduce all previously measured
differences. If we can't reproduce some previously seen result, that flags a
bug in our latest changes that we trace and fix. For the production test that
was used to generate the results for the paper, we ran SPEC using 9 iterations
(as documented in the paper). Since we have been working on this for over a
year, making multiple significant changes every week, most of the differences
that are reported here have been reproduced tens if not hundreds of times. Any
difference that cannot be reproduced many times is reported as a zero
difference, hence the many zero differences in the tables.

Furthermore, we have analyzed most of the benchmarks on which significant
differences has been measured (lbm, gromacs, cactus, sphinx, etc) and identified
the cause of the performance difference in each case. In most cases, the cause
is a reduction in register pressure in some hot basic block that causes a
significant reduction in spill code, as reported by the register allocator. For
example, on the lbm benchmark, using LLVM's scheduler causes the register
allocator to spill 12 virtual registers in the hottest function that amounts  to
99% of the execution time, while using the branch-and-bound scheduler causes the
register allocator to spill only 2 registers in that hot function.


So, we are quite certain that the reported differences are real and
reproducible.

Evan,

Please see my inlined answers below:

Thanks
-Ghassan




________________________________
 From: Evan Cheng <evan.cheng at apple.com>
To: Ghassan Shobaki <ghassan_shobaki at yahoo.com> 
Cc: "llvmdev at cs.uiuc.edu" <llvmdev at cs.uiuc.edu> 
Sent: Saturday, September 29, 2012 9:37 PM
Subject: Re: [LLVMdev] LLVM's Pre-allocation Scheduler Tested against a
Branch-and-Bound Scheduler
 



On Sep 29, 2012, at 2:43 AM, Ghassan Shobaki <ghassan_shobaki at
yahoo.com> wrote:

Hi,>
>We are currently working on revising a journal article that describes our work on pre-allocation scheduling using LLVM and have some questions about
LLVM's pre-allocation scheduler. The answers to these question will help us
better document and analyze the results of our benchmark tests that compare our
algorithm with LLVM's pre-allocation scheduling
algorithm.>
>First, here is a brief description of our work:
>
>We have developed a combinatorial algorithm for balancing instruction-level
parallelism (ILP) and register pressure (RP) during pre-allocation scheduling.
The algorithm is based on a branch-and-bound (B&B) approach, wherein the
objective function is a linear combination of schedule length and register
pressure. We have implemented this algorithm and integrated it into LLVM 2.9 as
an alternate pre-allocation scheduler. Then we compared the performance of our
(B&B) scheduler with that of LLVM's default scheduler on x86 (BURR
scheduler on x86-32 and ILP on x86-64) using SPEC CPU2006. The results show that our B&B scheduler significantly improves the performance of some
benchmarks relative to LLVM's default scheduler by up to 21%. The
geometric-mean speedup on FP2006 is about 2.4% across the entire suite. We then
observed that LLVM's ILP scheduler on x86-64 uses "rough" latency
values. So, we added the precise latency values published by Agner
(http://www.agner.org/optimize/) and that led to more speedup relative to
LLVM's ILP scheduler on some benchmarks. The most significant gain from
adding precise latencies was on the gromacs benchmark, which has a high degree
of ILP. I am attaching the benchmarking results on x86-64 using both LLVM's
rough latencies and Agner's precise latencies:>
>This work makes two points:
>
>
>-A B&B algorithm can discover significantly better schedules than a heuristic 
can do for some larger hard-to-schedule blocks, and if such blocks 
happen to occur in hot code, their scheduling will have a substantial 
impact on 
performance.>- A B&B algorithm is generally slower than a heuristic, but it is not a slow as most people think. By applying such an algorithm 
selectively to the hot blocks that are likely to benefit from it and 
setting some compile-time budget, a significant performance gain may be 
achieved with a relatively small increase in compile
time.>
>
>
>My questions are:
>
>
>1. Our current experimental results are based on LLVM 2.9. We definitely plan on upgrading to the latest LLVM version in our future 
work, but is there a fundamentally compelling reason for us to upgrade now to
3.1 for the sake of making the above points in the publication?
>Yes there is. While the pre-allocation scheduler has not had algorithmic changes
during the past year it has received minor tweaks which can impact performance.
Also note the scheduler is on its way out. I don't know when the article
will be published. But it's possible by the time the paper is published, you
would be essentially comparing against deprecated technology.

Ghassan: Are there any benchmarking results that quantify the impact of these
tweaks?
I remember reading on this list that the current scheduler will be replaced by a
new scheduler, but will the new scheduler be using any new algorithms for
reducing register pressure and balancing register pressure and ILP? I mean some
new algorithmic technique that is fundamentally different from the current
greedy heuristic approach that uses the Sethi-Ullman number as a priority scheme
for register pressure and the critical-path distance as a priority scheme for
ILP and makes a greedy choice between the two schemes depending on whether the
current register pressure is above or below a certain threshold. If yes, can you
point me to some LLVM documents or academic publications that describe this new
algorithm?

>
>2. The BURR scheduler on x86-32 appears to set all latencies to one (which
makes it a pure RR scheduler with no ILP), while the ILP scheduler onx86-64 appears to set all latencies to 10 expect for a few long-latency 
instructions. For the sake of documenting this in the paper, does anyone know
(or can point
me to) a precise description of how the scheduler sets latency values? 
In the revised paper, I will add experimental results based on precise 
latency values (see the attached spreadsheet) and would like to clearly 
document how LLVM's rough latencies for x86 are
determined.>I don't think your information is correct. The ILP scheduler is not setting
the latencies to 10. LLVM does not have machine models for x86 (except for atom)
so it's using a uniform latency model (one cycle).

Ghassan: I know that LLVM does not have a machine model for x86, but as far as
latency is concerned, setting all latencies to one means totally ignoring ILP
and scheduling for the sole objective of reducing register pressure. That's
what the BURR scheduler does on x86-32, but on x86-64 the default scheduler is
the ILP scheduler, which tries to do some minimal ILP scheduling. I am 100% sure
that this ILP scheduler sets some latency values to 10, because I actually print
the latency values and look at them. I have not attempted to do a thorough study
that identifies the latency value used for each instruction, but apparently most
latencies are set to one except a few long-latency instructions such as DIV. I
am looking for a more precise documentation of how the ILP scheduler sets these
latencies.



>
>3. Was the choice to use rough latency values in the ILP scheduler based on the fact that using precise latencies makes it much harder for a 
heuristic non-backtracking scheduler to balance ILP and RP or the choice was
made simply because nobody bothered to write an x86 itinerary?  
No one has bothered to write the itinerary.


>
>4. Does the ILP scheduler ever consider scheduling a stall (leaving a cycle empty) when there are ready instructions? Here is a small hypothetical
example that explains what I mean:>
>Suppose that at Cycle C the register pressure (RP) is equal to the physical 
limit and all ready instructions in that cycle start new live ranges, 
thus increasing the RP above the physical register limit. However, in a 
later cycle C+Delta some instruction X that closes a currently open live
 range will become ready. If the objective is minimizing RP, the right
 choice to make in this case is leaving Cycles C through C+Delta-1 empty
 and scheduling Instruction X in Cycle C+Delta. Otherwise, we will be 
increasing the RP. Does the ILP scheduler ever make such a choice or it will 
always schedule an instruction when the ready list is not
empty?>I don't believe so.
Ghassan: That can lead to significant increases in register pressure for basic
blocks having many long-latency instructions, do you agree? Do you know if the
new scheduling algorithm is going to resolve this problem? That's a very
hard problem to solve.


Evan


>Thank you in advance!
>
>-Ghassan  
>
>
>
>Ghassan Shobaki
>Assistant Professor
>Department of Computer Science
>Princess Sumaya University for Technology
>Amman, Jordan
>
>
>Attachments inlined:
>
>
>Rough Latencies
>
>
>Benchmark Branch-and-Bound LLVM 
> 
>
> SPEC Score SPEC Score % Score Difference 
>400.perlbench 21.2 20.2 4.95% 
>401.bzip2 13.9 13.6 2.21% 
>403.gcc 19.5 19.8 -1.52% 
>429.mcf 20.5 20.5 0.00% 
>445.gobmk 18.6 18.6 0.00% 
>456.hmmer 11.1 11.1 0.00% 
>458.sjeng 19.3 19.3 0.00% 
>462.libquantum 39.5 39.5 0.00% 
>464.h264ref 28.5 28.5 0.00% 
>471.omnetpp 15.6 15.6 0.00% 
>473.astar 13 13 0.00% 
>483.xalancbmk 21.9 21.9 0.00% 
>GEOMEAN 19.0929865 19.00588287     0.46% 
>410.bwaves  15.2 15.2 0.00% 
>416.gamess CE CE #VALUE! 
>433.milc  19 18.6 2.15% 
>434.zeusmp    14.2 14.2 0.00% 
>435.gromacs       11.6 11.3 2.65% 
>436.cactusADM 8.31 7.89 5.32% 
>437.leslie3d 11 11 0.00% 
>444.namd   16 16 0.00% 
>447.dealII 25.4 25.4 0.00% 
>450.soplex 26.1 26.1 0.00% 
>453.povray 20.5 20.5 0.00% 
>454.calculix 8.44 8.3 1.69% 
>459.GemsFDTD  10.7 10.7 0.00% 
>465.tonto CE CE #VALUE! 
>470.lbm 38.1 31.5 20.95% 
>481.wrf 11.6 11.6 0.00% 
>482.sphinx3 28.2 26.9 4.83% 
>GEOMEAN 15.91486307 15.54419555    2.38% 
>
>
>Precise Latencies
>
>
>Benchmark Branch-and-Bound LLVM 
> 
>
> SPEC Score SPEC Score % Score Difference 
>400.perlbench 21.2 20.2 4.95% 
>401.bzip2 13.9 13.6 2.21% 
>403.gcc 19.6 19.8 -1.01% 
>429.mcf 20.8 20.5 1.46% 
>445.gobmk 18.8 18.6 1.08% 
>456.hmmer 11.1 11.1 0.00% 
>458.sjeng 19.3 19.3 0.00% 
>462.libquantum 39.5 39.5 0.00% 
>464.h264ref 28.5 28.5 0.00% 
>471.omnetpp 15.6 15.6 0.00% 
>473.astar 13 13 0.00% 
>483.xalancbmk 21.9 21.9 0.00% 
>GEOMEAN 19.14131861 19.00588287  
> 0.71% 
>410.bwaves  15.5 15.2 1.97% 
>416.gamess CE CE #VALUE! 
>433.milc  19.3 18.6 3.76% 
>434.zeusmp    14.2 14.2 0.00% 
>435.gromacs       12.4 11.3 9.73% 
>436.cactusADM 7.7 7.89 -2.41% 
>437.leslie3d 11 11 0.00% 
>444.namd   16.2 16 1.25% 
>447.dealII 25.4 25.4 0.00% 
>450.soplex 26.1 26.1 0.00% 
>453.povray 20.5 20.5 0.00% 
>454.calculix 8.55 8.3 3.01% 
>459.GemsFDTD  10.5 10.7 -1.87% 
>465.tonto CE CE #VALUE! 
>470.lbm 38.8 31.5 23.17% 
>481.wrf 11.6 11.6 0.00% 
>482.sphinx3 28 26.9 4.09% 
>GEOMEAN 15.96082174 15.54419555     2.68% 
>
><x86-64_BB_vs_LLVM_roughLatencies.xls><x86-64_BB_vs_LLVM_preciseLatencies.xls>_______________________________________________
>LLVM Developers mailing list
>LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20120929/65feb547/attachment.html>

Rao, Prashantha

2012-Oct-01 19:18 UTC

head link

[LLVMdev] LLVM's Pre-allocation Scheduler Tested against a Branch-and-Bound Scheduler

Ghassan-

LLVM version 3.0 onwards has better register allocator, greedy register
allocator at higher optimization levels. This might handle the not-so-great
schedules better than the register allocation used in version 2.9.

-Prashantha

From: llvmdev-bounces at cs.uiuc.edu [mailto:llvmdev-bounces at cs.uiuc.edu] On
Behalf Of Ghassan Shobaki
Sent: Saturday, September 29, 2012 3:13 PM
To: llvmdev at cs.uiuc.edu
Subject: [LLVMdev] LLVM's Pre-allocation Scheduler Tested against a
Branch-and-Bound Scheduler

Hi,

We are currently working on revising a journal article that describes our work
on pre-allocation scheduling using LLVM and have some questions about LLVM's
pre-allocation scheduler. The answers to these question will help us better
document and analyze the results of our benchmark tests that compare our
algorithm with LLVM's pre-allocation scheduling algorithm.

First, here is a brief description of our work:

We have developed a combinatorial algorithm for balancing instruction-level
parallelism (ILP) and register pressure (RP) during pre-allocation scheduling.
The algorithm is based on a branch-and-bound (B&B) approach, wherein the
objective function is a linear combination of schedule length and register
pressure. We have implemented this algorithm and integrated it into LLVM 2.9 as
an alternate pre-allocation scheduler. Then we compared the performance of our
(B&B) scheduler with that of LLVM's default scheduler on x86 (BURR
scheduler on x86-32 and ILP on x86-64) using SPEC CPU2006. The results show that
our B&B scheduler significantly improves the performance of some benchmarks
relative to LLVM's default scheduler by up to 21%. The geometric-mean
speedup on FP2006 is about 2.4% across the entire suite. We then observed that
LLVM's ILP scheduler on x86-64 uses "rough" latency values. So, we
added the precise latency values published by Agner
(http://www.agner.org/optimize/) and that led to more speedup relative to
LLVM's ILP scheduler on some benchmarks. The most significant gain from
adding precise latencies was on the gromacs benchmark, which has a high degree
of ILP. I am attaching the benchmarking results on x86-64 using both LLVM's
rough latencies and Agner's precise latencies:

This work makes two points:
-A B&B algorithm can discover significantly better schedules than a
heuristic can do for some larger hard-to-schedule blocks, and if such blocks
happen to occur in hot code, their scheduling will have a substantial impact on
performance.
- A B&B algorithm is generally slower than a heuristic, but it is not a slow
as most people think. By applying such an algorithm selectively to the hot
blocks that are likely to benefit from it and setting some compile-time budget,
a significant performance gain may be achieved with a relatively small increase
in compile time.

My questions are:

1. Our current experimental results are based on LLVM 2.9. We definitely plan on
upgrading to the latest LLVM version in our future work, but is there a
fundamentally compelling reason for us to upgrade now to 3.1 for the sake of
making the above points in the publication?

2. The BURR scheduler on x86-32 appears to set all latencies to one (which makes
it a pure RR scheduler with no ILP), while the ILP scheduler on x86-64 appears
to set all latencies to 10 expect for a few long-latency instructions. For the
sake of documenting this in the paper, does anyone know (or can point me to) a
precise description of how the scheduler sets latency values? In the revised
paper, I will add experimental results based on precise latency values (see the
attached spreadsheet) and would like to clearly document how LLVM's rough
latencies for x86 are determined.

3. Was the choice to use rough latency values in the ILP scheduler based on the
fact that using precise latencies makes it much harder for a heuristic
non-backtracking scheduler to balance ILP and RP or the choice was made simply
because nobody bothered to write an x86 itinerary?

4. Does the ILP scheduler ever consider scheduling a stall (leaving a cycle
empty) when there are ready instructions? Here is a small hypothetical example
that explains what I mean:

Suppose that at Cycle C the register pressure (RP) is equal to the physical
limit and all ready instructions in that cycle start new live ranges, thus
increasing the RP above the physical register limit. However, in a later cycle
C+Delta some instruction X that closes a currently open live range will become
ready. If the objective is minimizing RP, the right choice to make in this case
is leaving Cycles C through C+Delta-1 empty and scheduling Instruction X in
Cycle C+Delta. Otherwise, we will be increasing the RP. Does the ILP scheduler
ever make such a choice or it will always schedule an instruction when the ready
list is not empty?

Thank you in advance!
-Ghassan

Ghassan Shobaki
Assistant Professor
Department of Computer Science
Princess Sumaya University for Technology
Amman, Jordan

Attachments inlined:

Rough Latencies

Benchmark

Branch-and-Bound

LLVM


SPEC Score

SPEC Score

% Score Difference

400.perlbench

21.2

20.2

4.95%

401.bzip2

13.9

13.6

2.21%

403.gcc

19.5

19.8

-1.52%

429.mcf

20.5

20.5

0.00%

445.gobmk

18.6

18.6

0.00%

456.hmmer

11.1

11.1

0.00%

458.sjeng

19.3

19.3

0.00%

462.libquantum

39.5

39.5

0.00%

464.h264ref

28.5

28.5

0.00%

471.omnetpp

15.6

15.6

0.00%

473.astar

13

13

0.00%

483.xalancbmk

21.9

21.9

0.00%

GEOMEAN

19.0929865

19.00588287

    0.46%

410.bwaves

15.2

15.2

0.00%

416.gamess

CE

CE

#VALUE!

433.milc

19

18.6

2.15%

434.zeusmp

14.2

14.2

0.00%

435.gromacs

11.6

11.3

2.65%

436.cactusADM

8.31

7.89

5.32%

437.leslie3d

11

11

0.00%

444.namd

16

16

0.00%

447.dealII

25.4

25.4

0.00%

450.soplex

26.1

26.1

0.00%

453.povray

20.5

20.5

0.00%

454.calculix

8.44

8.3

1.69%

459.GemsFDTD

10.7

10.7

0.00%

465.tonto

CE

CE

#VALUE!

470.lbm

38.1

31.5

20.95%

481.wrf

11.6

11.6

0.00%

482.sphinx3

28.2

26.9

4.83%

GEOMEAN

15.91486307

15.54419555

   2.38%


Precise Latencies

Benchmark

Branch-and-Bound

LLVM


SPEC Score

SPEC Score

% Score Difference

400.perlbench

21.2

20.2

4.95%

401.bzip2

13.9

13.6

2.21%

403.gcc

19.6

19.8

-1.01%

429.mcf

20.8

20.5

1.46%

445.gobmk

18.8

18.6

1.08%

456.hmmer

11.1

11.1

0.00%

458.sjeng

19.3

19.3

0.00%

462.libquantum

39.5

39.5

0.00%

464.h264ref

28.5

28.5

0.00%

471.omnetpp

15.6

15.6

0.00%

473.astar

13

13

0.00%

483.xalancbmk

21.9

21.9

0.00%

GEOMEAN

19.14131861

19.00588287

0.71%

410.bwaves

15.5

15.2

1.97%

416.gamess

CE

CE

#VALUE!

433.milc

19.3

18.6

3.76%

434.zeusmp

14.2

14.2

0.00%

435.gromacs

12.4

11.3

9.73%

436.cactusADM

7.7

7.89

-2.41%

437.leslie3d

11

11

0.00%

444.namd

16.2

16

1.25%

447.dealII

25.4

25.4

0.00%

450.soplex

26.1

26.1

0.00%

453.povray

20.5

20.5

0.00%

454.calculix

8.55

8.3

3.01%

459.GemsFDTD

10.5

10.7

-1.87%

465.tonto

CE

CE

#VALUE!

470.lbm

38.8

31.5

23.17%

481.wrf

11.6

11.6

0.00%

482.sphinx3

28

26.9

4.09%

GEOMEAN

15.96082174

15.54419555

    2.68%


-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20121001/d6bb6d29/attachment.html>

Andrew Trick

2012-Oct-05 05:31 UTC

head link

[LLVMdev] LLVM's Pre-allocation Scheduler Tested against a Branch-and-Bound Scheduler

On Sep 29, 2012, at 11:37 AM, Evan Cheng <evan.cheng at apple.com> wrote:
> 
> On Sep 29, 2012, at 2:43 AM, Ghassan Shobaki <ghassan_shobaki at
yahoo.com> wrote:
> 
>> Hi,
>> 
>> We are currently working on revising a journal article that describes
our work on pre-allocation scheduling using LLVM and have some questions about
LLVM's pre-allocation scheduler. The answers to these question will help us
better document and analyze the results of our benchmark tests that compare our
algorithm with LLVM's pre-allocation scheduling algorithm.
>> 
>> First, here is a brief description of our work:
>> 
>> We have developed a combinatorial algorithm for balancing
instruction-level parallelism (ILP) and register pressure (RP) during
pre-allocation scheduling. The algorithm is based on a branch-and-bound
(B&B) approach, wherein the objective function is a linear combination of
schedule length and register pressure. We have implemented this algorithm and
integrated it into LLVM 2.9 as an alternate pre-allocation scheduler. Then we
compared the performance of our (B&B) scheduler with that of LLVM's
default scheduler on x86 (BURR scheduler on x86-32 and ILP on x86-64) using SPEC
CPU2006. The results show that our B&B scheduler significantly improves the
performance of some benchmarks relative to LLVM's default scheduler by up to
21%. The geometric-mean speedup on FP2006 is about 2.4% across the entire suite.
We then observed that LLVM's ILP scheduler on x86-64 uses "rough"
latency values. So, we added the precise latency values published by Agner
(http://www.agner.org/optimize/) and that led to more speedup relative to
LLVM's ILP scheduler on some benchmarks. The most significant gain from
adding precise latencies was on the gromacs benchmark, which has a high degree
of ILP. I am attaching the benchmarking results on x86-64 using both LLVM's
rough latencies and Agner's precise latencies:
>> 
>> This work makes two points:
>> 
>> -A B&B algorithm can discover significantly better schedules than a
heuristic can do for some larger hard-to-schedule blocks, and if such blocks
happen to occur in hot code, their scheduling will have a substantial impact on
performance.
>> - A B&B algorithm is generally slower than a heuristic, but it is
not a slow as most people think. By applying such an algorithm selectively to
the hot blocks that are likely to benefit from it and setting some compile-time
budget, a significant performance gain may be achieved with a relatively small
increase in compile time.
>> 
>> My questions are:
>> 
>> 1. Our current experimental results are based on LLVM 2.9. We
definitely plan on upgrading to the latest LLVM version in our future work, but
is there a fundamentally compelling reason for us to upgrade now to 3.1 for the
sake of making the above points in the publication?
> 
> Yes there is. While the pre-allocation scheduler has not had algorithmic
changes during the past year it has received minor tweaks which can impact
performance. Also note the scheduler is on its way out. I don't know when
the article will be published. But it's possible by the time the paper is
published, you would be essentially comparing against deprecated technology.

Ghassan,

Evan is right. Your speedups from enhancing the scheduler's latency are
probably different with 3.1. But that doesn't really change your two main
points that (1) you can further reduce register pressure and (2) your compile
time isn't horrible. I think the major heuristic changes in the PreRA
scheduler went in before llvm 2.9. OTOH, Prasantha makes a good point that the
new register allocator may diminish the impact of a bad schedule.
At the very least, I would check the code being generated by 3.1 for the few
benchmarks that  interest you to see how much the baseline changed.

Another thing that you may want to experiment with...

I implemented a register pressure tracking scheduler earlier this year. You can
run it yourself on x86-64 with the flag -enable-misched. The current
implementation is a platform for experimentation only. I haven't invested
any time developing heuristics, which will need to work across a range of
interesting benchmarks and subtargets. Others have done some performance
analysis with the framework, and may be able to chime in on it's good points
or weak points.

This is not based on Sethi-Ullman in any way. There are a number of interesting
heuristics that I can think of implementing, but will not be doing so until I
can justify adding the cost and complexity.

The current (experimental) MI scheduler implements a 3-level
"back-off":

1) Respect the target's register limits at all times.

2) Indentify critical register classes (pressure sets) before scheduling.
   Track pressure within the currently scheduled region.
   Avoid increasing scheduled pressure for critical registers.

3) Avoid exceeding the max pressure of the region prior to scheduling (don't
make things locally worse).

Think of these crude heuristics as baseline. This is what can be done easily and
cheaply. Anything more sophisticated has to surpass this low bar.

All of the heuristics that I have planned are greedy, and none are
sophisticated, but some require precomputing register lineages
(dependence chains that reuse a single register). An interesting twist is that
the MI scheduler can alternate between top-up and bottom-down,
which doesn't fundamentally change the problem, but avoids the common cases
in which greedy schedulers "get stuck".

Here's my plan for the near future:

- SpillCost: Map register units onto a spill cost that is more meaningful for
heuristics.

- Pressure Query: (compile time) Redesign the pressure tracker to summarize
information at the instruction level for fast queries during scheduling.

- Pressure Range: Before scheduling, compute the high pressure region as a range
of instructions. If the scheduler is not currently under pressure, prioritize
instructions from within the range.

- Register Lineages: Before scheduling, use a heuristic to select desirable
lineages. Select the longest lineage from the queue. After scheduling an
instruction, look at the next instruction in the lineage. If it has an
unscheduled operand, mark that operand's lineage as pending, and priortize
the head of that lineage. This solves some interesting cases where a greedy
scheduler is normally unable to choose among a set of identical looking
instructions by knowing how their dependence chain relates to any already
scheduled instructions.

-Andy
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20121004/8ed68dd4/attachment.html>

Andrew Trick

2012-Oct-05 05:42 UTC

head link

[LLVMdev] LLVM's Pre-allocation Scheduler Tested against a Branch-and-Bound Scheduler

On Sep 29, 2012, at 11:37 AM, Evan Cheng <evan.cheng at apple.com> wrote:
>> 
>> 2. The BURR scheduler on x86-32 appears to set all latencies to one
(which makes it a pure RR scheduler with no ILP), while the ILP scheduler on
x86-64 appears to set all latencies to 10 expect for a few long-latency
instructions. For the sake of documenting this in the paper, does anyone know
(or can point me to) a precise description of how the scheduler sets latency
values? In the revised paper, I will add experimental results based on precise
latency values (see the attached spreadsheet) and would like to clearly document
how LLVM's rough latencies for x86 are determined.
> 
> I don't think your information is correct. The ILP scheduler is not
setting the latencies to 10. LLVM does not have machine models for x86 (except
for atom) so it's using a uniform latency model (one cycle).
Evan's description is precise. Everything is one cycle, unless it is 10
cycles ;) But it's easy to reconfigure to use itineraries, as I guess
you've done.
>> 
>> 3. Was the choice to use rough latency values in the ILP scheduler
based on the fact that using precise latencies makes it much harder for a
heuristic non-backtracking scheduler to balance ILP and RP or the choice was
made simply because nobody bothered to write an x86 itinerary?
> No one has bothered to write the itinerary.
I recently committed infrastructure that allows machine models to be developed
incrementally, at the level of detail appropriate for the processor. I have a
feeling we will start to see models begin to evolve for x86 processors very
soon. The framework is documented in TargetSchedule.td and you're welcome to
contribute.

The only feature that I still plan to implement is the ability of a machine
model to specify that it is derived from another. It will be trivial to add
though when the time comes.
>> 4. Does the ILP scheduler ever consider scheduling a stall (leaving a
cycle empty) when there are ready instructions? Here is a small hypothetical
example that explains what I mean:
>> 
>> Suppose that at Cycle C the register pressure (RP) is equal to the
physical limit and all ready instructions in that cycle start new live ranges,
thus increasing the RP above the physical register limit. However, in a later
cycle C+Delta some instruction X that closes a currently open live range will
become ready. If the objective is minimizing RP, the right choice to make in
this case is leaving Cycles C through C+Delta-1 empty and scheduling Instruction
X in Cycle C+Delta. Otherwise, we will be increasing the RP. Does the ILP
scheduler ever make such a choice or it will always schedule an instruction when
the ready list is not empty?
The standard ILP scheduler does not have a "ReadyFilter" so
instructions are inserted in the ready queue the moment their predecessors are
scheduled. So, yes, it will effectively "impose stalls" to reduce
register pressure. Note that things work differently with an itinerary though.
And the answer will depend on how you've written the itinerary.

-Andy
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20121004/9fff170c/attachment.html>

Rafael Espíndola

2012-Oct-07 21:53 UTC

head link

[LLVMdev] LLVM's Pre-allocation Scheduler Tested against a Branch-and-Bound Scheduler

> ... are these differences statistically significant?  In my experience SPEC
> scores can vary considerably on different runs, so how many runs did you do
> and what was the estimated standard deviation?
It is also subject to measurement bias. You probably want to look at

http://www.inf.usi.ch/faculty/hauswirth/publications/asplos09.pdf
> Best wishes, Duncan.
Cheers,
Rafael

Seemingly Similar Threads

Search for more apparently analagous threads

llvm dev - Sep 2012 - [LLVMdev] LLVM's Pre-allocation Scheduler Tested against a Branch-and-Bound Scheduler

[LLVMdev] LLVM's Pre-allocation Scheduler Tested against a Branch-and-Bound Scheduler

[LLVMdev] LLVM's Pre-allocation Scheduler Tested against a Branch-and-Bound Scheduler

[LLVMdev] LLVM's Pre-allocation Scheduler Tested against a Branch-and-Bound Scheduler

[LLVMdev] LLVM's Pre-allocation Scheduler Tested against a Branch-and-Bound Scheduler

[LLVMdev] LLVM's Pre-allocation Scheduler Tested against a Branch-and-Bound Scheduler

[LLVMdev] LLVM's Pre-allocation Scheduler Tested against a Branch-and-Bound Scheduler

[LLVMdev] LLVM's Pre-allocation Scheduler Tested against a Branch-and-Bound Scheduler

[LLVMdev] LLVM's Pre-allocation Scheduler Tested against a Branch-and-Bound Scheduler

Seemingly Similar Threads