On Mar 24, 2008, at 10:47 PM, Christopher Lamb wrote:> I know that this has been discussed (at least in passing) a few
> times on the list, but I couldn't locate a bug for it. Have any
> architectural plans been made for it?
Funny you bring this up. Evan and I were tossing around crazy ideas
about this just today. If you're interested, maybe we should get
together for lunch or coffee or something.
> Are there architectural roadblocks with the current LLVM
> infrastructure that will complicate the process? What has
> demotivated the implementation of this so far (is it not that big a
> benefit on important targets, too much time/effort for the payoff)?
There is a huge payoff to doing this or even making progress on it.
It is an elegant solution to many issues. There are two current
problems:
1) There are some scalability issues (in compile time) with large
selectiondags. Roman has been tackling this recently, so this will
hopefully be a non-issue really soon. In any case, it would be very
reasonable continue doing single MBB selectiondags when in "-fast"
mode (i.e. -O0) and do whole function ones when compile time is worth
it (-O2+ etc).
2) The bigger issue is one of scheduling. The great thing about whole
function isel is that nothing up to scheduling needs to care about
what block something is in (as long as chain nodes pin side-effecting
instrs to their block of course). Things like machine-code loop
invariant code motion, CSE of values from different blocks, sinking
and many other details are easily handled, and directly in the
selection dags: this is really cool! :) The problem is that
*something has to decide what block to put all the operations in. The
most logical thing to do here is to have a pass either part of sched
or right before it that assigns a block to each node (which is a
target instr node) and then have sched remain a single-bb at a time
(for now, it could obviously be improved later to work on traces etc).
The problem is that in full generality, this is a problem very similar
to PRE: assigning blocks to operations optimally is complex and
somewhat subtle. An excellent place to start looking at this is Cliff
Click's thesis, which is available online (and incidentally inspired a
lot of selection dag design). In his thesis, he has a hacky solution
that is probably good enough, certainly to start out with.
The reason this hasn't been done so far is that we (people at apple)
haven't been able to justify the "big project" of tackling this
when
there are still tons of little projects that are lower hanging for our
needs. We will almost certainly do this someday (and would have
already done it if it were blocking some project of ours), but would
love to have someone work on it in the shorter term. The thing that
Evan and I were discussing today is that there are a variety of ways
to get intermediate results without tackling the full generality of
the problem:
1) One annoying problem for targets like PPC and ARM (which promote i8/
i16 to i32) can be solved with a really simple hack. Consider a
switch on i16 that codegens to a branch tree. This turns into a bunch
of small blocks: the first zext's the value then does an unsigned
comparison against the range typically, and each subsequent block does
an unsigned comparison against the subportion of the range. The
problem is that right now, the subsequent comparisons just see that
the i16 value is promoted, which means the top bits are unknown. This
means that each block ends up doing tons of redundant "and"s to clear
out the top bits before their comparison. This is obviously a special
case of a general problem: uses of values in blocks that aren't the
def can't see properties of the value.
This (and similar problems) boil down to not knowing info about a
cross-block promoted value. This sort of thing can be solved with a
very simple hack by adding a few maps to SDISel. The idea is that, in
ISel, each cross-block CopyToReg for a Vreg can call ComputeMaskedBits
and ComputeSignBits and record the known live out bit+sign information
for the vreg. Later, if someone calls ComputeMaskedBits or
ComputeSignBits and it recurses up to a CopyFromReg from the vreg, the
information can be provided from its definition.
This elegantly and cheaply solves the problem. The reason I call it a
hack is that it means you get different code for a function based on
the order that blocks are selected (i.e. it is better to compile the
defs of values before the uses), which is annoying, but actually
probably just fine in practice. Codegen remains deterministic of
course.
2) An intermediate step to whole function isel is to do region based
isel with regions limited by what makes the sched problem simple. One
very interesting class of region is a Single-Entry-Multiple-Exit
region. It would be straight-forward to break each function into SEME
regions and select them one at a time, allowing them to compile to
multiple MBBs.
In addition to the obvious dag combiner and isel improvements that
would come from larger regions, SEME regions encapsulate a number of
interesting things (such as switch lowering). Having SEME regions
would allow us to remove the block juggling in SDISel: switches would
just lower to multiple blocks in one dag. SEME regions are also
totally trivial for the sched problem: because your region is just a
tree, all you have to do is move a computation "up" the tree towards
the root to the place that dominates all its uses. Another
interesting thing about SEME regions is that you don't have to worry
about phi nodes etc.
OTOH, SEME regions are still limiting, so we would eventually want to
do whole function (or at least "arbitrary region") isel. SEME regions
do handle "sinking" well, but don't handle LICM (because you
can't
have a whole loop in the region) and don't handle the diamond pattern
than "select" lowers to when a target doesn't support cmov.
Wouldn't
it be nice if legalize could lower select directly for targets (by
inserting blocks into the DAG), instead of having to do the "custom
sched inserter" hack? :) Wouldn't it be nice if dag combine could
actually itself form cmovs from diamond shapes in the cfg? :)
At the end of the day, attacking a subclass of the "big problem" is a
really interesting way (to me) to incrementally tackle this problem,
and I strongly encourage anyone interested in this to start with SEME
regions first. This allows us to figure out things like "how do we
represent the start of a bb" (probably a new chained node), "how does
the sched pass set the block that an instruction goes in" (probably a
new field in SDNode), and allows us to do some cleanups (e.g. switch
lowering) without tackling some of the other hard problems. Once SEME
regions are working really well, we can tackle the rest of the hard
problems, because they are obviously worthwhile doing eventually.
> In toying with some code generators I've been frustrated in getting
> the code gen prepare pass + isel to implement the heuristics I need
> to match some addressing modes and I believe that whole-function
> isel might be the answer.
I would love to see any progress in this area. It is clearly an
important thing to tackle, and it is blocking other interesting
improvements in the code generator. It would also allow us to
eliminate a significant amount of weirdness that exists to hack around
this (e.g. switch lowering). That said, even when we *can* do whole
function isel, I think we should keep the ability to handle subregions
of the function for -O0, at least until SD stuff scales perfectly :).
-Chris