thr3ads.net - llvm dev - [LLVMdev] test/Transforms/DecomposeMultiDimRefs [Mar 2008]

If this information is useful, please help other people find it:
Share via:

Christopher Lamb

2008-Mar-25 05:47 UTC

[LLVMdev] Whole-function isel

I know that this has been discussed (at least in passing) a few times  
on the list, but I couldn't locate a bug for it. Have any  
architectural plans been made for it?

Are there architectural roadblocks with the current LLVM  
infrastructure that will complicate the process? What has demotivated  
the implementation of this so far (is it not that big a benefit on  
important targets, too much time/effort for the payoff)?

In toying with some code generators I've been frustrated in getting  
the code gen prepare pass + isel to implement the heuristics I need  
to match some addressing modes and I believe that whole-function isel  
might be the answer.
--
Christopher Lamb



-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20080324/398aa7b8/attachment.html>

Chris Lattner

2008-Mar-25 06:38 UTC

head link

[LLVMdev] Whole-function isel

On Mar 24, 2008, at 10:47 PM, Christopher Lamb wrote:> I know that this has been discussed (at least in passing) a few  
> times on the list, but I couldn't locate a bug for it. Have any  
> architectural plans been made for it?
Funny you bring this up.  Evan and I were tossing around crazy ideas  
about this just today.  If you're interested, maybe we should get  
together for lunch or coffee or something.
> Are there architectural roadblocks with the current LLVM  
> infrastructure that will complicate the process? What has  
> demotivated the implementation of this so far (is it not that big a  
> benefit on important targets, too much time/effort for the payoff)?
There is a huge payoff to doing this or even making progress on it.   
It is an elegant solution to many issues.  There are two current  
problems:

1) There are some scalability issues (in compile time) with large  
selectiondags.  Roman has been tackling this recently, so this will  
hopefully be a non-issue really soon.  In any case, it would be very  
reasonable continue doing single MBB selectiondags when in "-fast"  
mode (i.e. -O0) and do whole function ones when compile time is worth  
it (-O2+ etc).

2) The bigger issue is one of scheduling.  The great thing about whole  
function isel is that nothing up to scheduling needs to care about  
what block something is in (as long as chain nodes pin side-effecting  
instrs to their block of course).  Things like machine-code loop  
invariant code motion, CSE of values from different blocks, sinking  
and many other details are easily handled, and directly in the  
selection dags: this is really cool! :)  The problem is that  
*something has to decide what block to put all the operations in.  The  
most logical thing to do here is to have a pass either part of sched  
or right before it that assigns a block to each node (which is a  
target instr node) and then have sched remain a single-bb at a time  
(for now, it could obviously be improved later to work on traces etc).

The problem is that in full generality, this is a problem very similar  
to PRE: assigning blocks to operations optimally is complex and  
somewhat subtle.  An excellent place to start looking at this is Cliff  
Click's thesis, which is available online (and incidentally inspired a  
lot of selection dag design).  In his thesis, he has a hacky solution  
that is probably good enough, certainly to start out with.

The reason this hasn't been done so far is that we (people at apple)  
haven't been able to justify the "big project" of tackling this
when
there are still tons of little projects that are lower hanging for our  
needs. We will almost certainly do this someday (and would have  
already done it if it were blocking some project of ours), but would  
love to have someone work on it in the shorter term.  The thing that  
Evan and I were discussing today is that there are a variety of ways  
to get intermediate results without tackling the full generality of  
the problem:

1) One annoying problem for targets like PPC and ARM (which promote i8/ 
i16 to i32) can be solved with a really simple hack.  Consider a  
switch on i16 that codegens to a branch tree.  This turns into a bunch  
of small blocks: the first zext's the value then does an unsigned  
comparison against the range typically, and each subsequent block does  
an unsigned comparison against the subportion of the range.  The  
problem is that right now, the subsequent comparisons just see that  
the i16 value is promoted, which means the top bits are unknown.  This  
means that each block ends up doing  tons of redundant "and"s to clear
out the top bits before their comparison.  This is obviously a special  
case of a general problem: uses of values in blocks that aren't the  
def can't see properties of the value.

This (and similar problems) boil down to not knowing info about a  
cross-block promoted value. This sort of thing can be solved with a  
very simple hack by adding a few maps to SDISel.  The idea is that, in  
ISel, each cross-block CopyToReg for a Vreg can call ComputeMaskedBits  
and ComputeSignBits and record the known live out bit+sign information  
for the vreg.  Later, if someone calls ComputeMaskedBits or  
ComputeSignBits and it recurses up to a CopyFromReg from the vreg, the  
information can be provided from its definition.

This elegantly and cheaply solves the problem.  The reason I call it a  
hack is that it means you get different code for a function based on  
the order that blocks are selected (i.e. it is better to compile the  
defs of values before the uses), which is annoying, but actually  
probably just fine in practice.  Codegen remains deterministic of  
course.

2) An intermediate step to whole function isel is to do region based  
isel with regions limited by what makes the sched problem simple.  One  
very interesting class of region is a Single-Entry-Multiple-Exit  
region.  It would be straight-forward to break each function into SEME  
regions and select them one at a time, allowing them to compile to  
multiple MBBs.

In addition to the obvious dag combiner and isel improvements that  
would come from larger regions, SEME regions encapsulate a number of  
interesting things (such as switch lowering).  Having SEME regions  
would allow us to remove the block juggling in SDISel: switches would  
just lower to multiple blocks in one dag.  SEME regions are also  
totally trivial for the sched problem: because your region is just a  
tree, all you have to do is move a computation "up" the tree towards  
the root to the place that dominates all its uses.  Another  
interesting thing about SEME regions is that you don't have to worry  
about phi nodes etc.

OTOH, SEME regions are still limiting, so we would eventually want to  
do whole function (or at least "arbitrary region") isel.  SEME regions
do handle "sinking" well, but don't handle LICM (because you
can't
have a whole loop in the region) and don't handle the diamond pattern  
than "select" lowers to when a target doesn't support cmov. 
Wouldn't
it be nice if legalize could lower select directly for targets (by  
inserting blocks into the DAG), instead of having to do the "custom  
sched inserter" hack? :)  Wouldn't it be nice if dag combine could  
actually itself form cmovs from diamond shapes in the cfg? :)

At the end of the day, attacking a subclass of the "big problem" is a
really interesting way (to me) to incrementally tackle this problem,  
and I strongly encourage anyone interested in this to start with SEME  
regions first.  This allows us to figure out things like "how do we  
represent the start of a bb" (probably a new chained node), "how does
the sched pass set the block that an instruction goes in" (probably a  
new field in SDNode), and allows us to do some cleanups (e.g. switch  
lowering) without tackling some of the other hard problems.  Once SEME  
regions are working really well, we can tackle the rest of the hard  
problems, because they are obviously worthwhile doing eventually.
> In toying with some code generators I've been frustrated in getting  
> the code gen prepare pass + isel to implement the heuristics I need  
> to match some addressing modes and I believe that whole-function  
> isel might be the answer.
I would love to see any progress in this area.  It is clearly an  
important thing to tackle, and it is blocking other interesting  
improvements in the code generator.  It would also allow us to  
eliminate a significant amount of weirdness that exists to hack around  
this (e.g. switch lowering).  That said, even when we *can* do whole  
function isel, I think we should keep the ability to handle subregions  
of the function for -O0, at least until SD stuff scales perfectly :).

-Chris

Gabor Greif

2008-Mar-25 07:28 UTC

head link

[LLVMdev] test/Transforms/DecomposeMultiDimRefs

Looks like this suite is pretty empty :-)

May I delete it or should it be beefed up some time?

Cheers,

	Gabor

Dietmar Ebner

2008-Mar-25 16:40 UTC

head link

[LLVMdev] Whole-function isel

Chris,

Chris Lattner wrote:> I would love to see any progress in this area.  It is clearly an  
> important thing to tackle, and it is blocking other interesting  
> improvements in the code generator.  It would also allow us to  
> eliminate a significant amount of weirdness that exists to hack around  
> this (e.g. switch lowering).we've been working on a whole-function instruction selector here at the 
vienna university of technology in the recent past. our approach
considers ssa graphs and is based on a problem transformation to a
specialized quadratic assignment problem (pbqp). in contrast to previous
work [1], the technique is flexible enough to cope with general DAG
patterns such as pre/postincrement or divmod patterns.

the instruction selector is a drop-in replacement for the original 
implementation (llvm 2.1). we've used the ARM backend for evaluation and 
obtained quite encouraging results: speedups are up to 10% for 
SPEC/Mibench and up to 57% for simple loop kernels. the compile time 
increase is about a factor of 2.

the paper has been accepted for this year's LCTES conference (june 
12-13, tucson, az). i'm afraid i cannot post it on the list but i'm 
happy to send a preliminary version to anybody who's interested.

the implementation is not yet as efficient as it could be (e.g., we 
maintain an additional datastructure for the ssa graph along with the 
selection DAG) and i'm afraid there are licensing issues that do not 
allow me to directly post or contribute the code. however, i'm happy to 
share further experimental results and discuss the approach in case 
somebody is interested.

-
dietmar

[1] Erik Eckstein, Oliver König and Bernhard Scholz
     Code Instruction Selection Based on SSA-Graphs
     SCOPES 2003
     http://springerlink.metapress.com/content/83cj0ebgtm998hj8

--
---------------------------------------------------------------------
Dietmar Ebner
CD Laboratory - Compilation Techniques for Embedded Processors
Institut fuer Computersprachen  E: ebner at complang.tuwien.ac.at
Technische Universitaet Wien    F: (+431) 58801-18598
Argentinierstrasse 8 / E1851    T: (+431) 58801-58521
1040 Wien, Austria              W: www.complang.tuwien.ac.at/cd/ebner

Christopher Lamb

2008-Apr-21 04:59 UTC

head link

[LLVMdev] Whole-function isel

I thought I'd share a little bit of progress I made this weekend.  
I've gotten the first interesting test-case (a simple switch) through  
hyperblock-based DAGISel, and there's a pretty picture too! Each part  
of the switch is emitted directly into the DAG, rather than being  
deferred.

This is the function:
define i32 @foo(i32 %x, i32 %z) nounwind  {
entry:
         switch i32 %x, label %UnifiedReturnBlock [
                  i32 0, label %bb
                  i32 1, label %bb5
         ]
bb:             ; preds = %entry
         %tmp4 = mul i32 %z, %x          ; <i32> [#uses=1]
         ret i32 %tmp4
bb5:            ; preds = %entry
         %tmp8 = add i32 %z, %x          ; <i32> [#uses=1]
         ret i32 %tmp8
UnifiedReturnBlock:             ; preds = %entry
         ret i32 %z
}

And here's the X86 DAG right before it gets fed one BB at a time into  
pre-RA scheduling. Each color is a different basic block.

Cheers!
--
Chris

-------------- next part --------------
A non-text attachment was scrubbed...
Name: simple_switch.pdf
Type: application/pdf
Size: 33075 bytes
Desc: not available
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20080420/fa2c55e7/attachment.pdf>

Seemingly Similar Threads

Search for more apparently analagous threads

llvm dev - Mar 2008 - [LLVMdev] test/Transforms/DecomposeMultiDimRefs

[LLVMdev] Whole-function isel

[LLVMdev] Whole-function isel

[LLVMdev] test/Transforms/DecomposeMultiDimRefs

[LLVMdev] Whole-function isel

[LLVMdev] Whole-function isel

Seemingly Similar Threads