Hello everyone,
I have been thinking about implementing a fence elimination algorithm based
on PRE over the last few weeks. It will operate on target-dependent fences
(as there are very few target-independent fence elimination opportunities),
but should be general enough to be used by both the ARM and the Power
backends just by tweaking some simple hooks. In this email I will try to
summarize my ideas about it, to get some feedback before I start
implementing. If you have any question/suggestion/comment/criticism, please
say so !
*TL;DR:* I have a handful of questions:
- Should I try to make the code general enough to be used for full PRE
later, or only care about fences ?
- Is it ok to use metadata to preserve information that would otherwise be
lost by a transformation pass (and would help a later pass) ?
- If you have read the text below, does the approach make sense ?
Thank you,
Robin Morisset
Fence elimination algorithmBackground, link to Partial Redundancy
Elimination
There is already a local fence elimination algorithm for ARM
(lib/Target/ARM/ARMOptimizeBarriersPass.cpp), that tries to find pairs of
dmb with no memory access in between and eliminate one of the two (report
describing the idea/results
<http://llvm.org/devmtg/2014-04/PDFs/Talks/Reinoud-report.pdf>). The goal
here is to do better, in particular to hoist fences out of loops and do
optimization across basic blocks.
This is closely related to PRE
<http://en.wikipedia.org/wiki/Partial_redundancy_elimination>: we want a
dmb on every path between some accesses and some other accesses, PRE places
a computation on every path between the definition of its operand and the
uses of its results. Sadly, only an extremely limited form of PRE is done
in LLVM, inside a sub-phase of the GVN pass, so we cannot reuse existing
infrastructure.
Most modern papers on PRE see it as a flow problem: if all the operands are
connected to a source and all the uses of the computation are connected to
a sink, then the goal is to find a cut of the graph between the source and
the sink that verifies an additional property.
For classical PRE (i.e. conservative, static), the extra property is that
no path through the CFG goes through more instances of the computation than
before. For speculative PRE (SPRE), it is that the cut is a min-cut, when
all edges in the CFG are annotated with their frequency (which requires
profiling-based information in general). The same distinction can be made
in our case: using profiling-based information allows more aggressive
optimization but with a risk of mis-optimization if the information is
wrong.
Different levels of fences
Quite often, there will be different kinds of fences to deal with, such as
dmb ish/dmb ishst on swift ARM processors, or hwsync/lwsync on Power. I
suggest a rather greedy approach: first insert the strongest fences without
looking at the weaker ones; then insert the weaker ones while taking into
account the stronger ones. If two fences are uncomparable, then insert each
one independently, completely ignoring the other. The cost of this approach
is having to do a cut for each kind of fences. However, I don’t expect the
graphs under consideration to often be big.
Building the graph
The graph on which we will compute the cut is based on the CFG, but there
are several possible ways to tweak it, mostly to accelerate the algorithm:
-
By default, there would be two nodes for each instruction (beginning and
end), with an infinite weight on them (impossible to insert a Fence in the
middle of an instruction), and an edge between the end of an instruction
and the beginning of the next. We cannot use just one node per instruction,
because the beginning of an instruction may be a sink, while its end is a
source.
-
We can merge together all the instructions in a basic block which are
not memory accesses (since fences commute with those anyway) and do not
affect control-flow.
-
We can put an infinite weight on edges between basic blocks (fences must
be inside BBs in this case) or not (we must then do edge-splitting to
insert fences there)
-
The cut algorithm cost is only incurred on code with fences, and any
part of the graph which is unreachable from the source or co-unreachable
from the sink can be pruned (or not even built).
Basic algorithm
The basic algorithm I propose is thus the following:
1.
Gather all Fences
2.
For each Fence:
1.
On every path going down from the Fence, find the first memory
access/return/call, and connect the beginning of this instruction to the
sink (with infinite weight). If instead we first reach a stronger Fence,
then do nothing for that path (do not connect it to the sink).
2.
On every path going up from the Fence, find the first memory
access/call/beginning of function, and connect the source to its
end (with
infinite weight). If instead we first reach a stronger Fence, then do
nothing for that path.
3.
Remove the Fence
3.
Execute the cut algorithm on this graph (only keeping the nodes/edges
traversed above, merging all adjacent nodes that are not branches or memory
accesses/call/return), and insert a Fence on every edge that is part of the
cut.
It is fairly obvious that this algorithm preserves the property that every
pair of memory accesses that was separated by a fence still is (and of the
same kind since we are operating only on fences of the same kind at a time).
In terms of implementation, the steps 1 & 2.1 & 2.2 could be one
analysis
pass, and 2.3 & 3 could be a transform pass depending on it.
Improvements
A nice property of this framework is that it can easily be expanded to
allow more optimizations. Here are some examples that may be added
incrementally later.
Merging barriers coming from the same location
Accesses to the same location are automatically ordered on any reasonable
architecture, so we do not need fences to order them. So when looking
upwards from the fence introduced by a release, or downwards from the fence
introduced by an acquire, we can skip any access to the same location (with
the same size obviously). This may sound rare, but in practice it is
crucial to hoisting fences out of loops. Consider for example the following
code (which can be found any time a thread spins waiting on another one):
while (true)
if (x.load(acquire)) break;
Without this optimization, there must be a Fence after every load, whereas
after it there can be a single Fence after the loop. The only difficulty
here is finding from what access (if any) a fence comes from.. see next
part of this doc for ides on how to do that.
Making the optimization interprocedural
It is fairly simple to make the pass partially (greedily) interprocedural
for free: operate on the functions bottom-up, whenever a Fence is inserted
just after the start of a function (resp. just before every return), tag
this function with some special piece of metadata. Then whenever hitting a
call to such a function when searching downwards (resp upwards) of a Fence,
stop the search without connecting to the sink (resp source).
This is not necessarily optimal (as we cannot push a Fence through a
function for example), but it is free and may only improve things (get rid
of some Fences).
Turning acquire into consume
Neither Power nor ARM can reorder two memory accesses, if the first is a
read, and the second has a data or address dependency on the first. This
means that when searching memory accesses downward of a Fence fence
inserted by an acquire read, we can skip any access that depends on that
acquire read.
This would for example allow optimizing the following:
tmp = x.load(acquire);
tmp2 = (*tmp).load(acquire);
return tmp2;
Which by default would require two Fences (after each load). Because the
second load depends on the first, only the second Fence is required (in the
algorithm, the source would be connected to the ends of both loads, but
only the return would be connected to the sink).
In practice, this optimization might be extremely difficult to implement
because of phase ordering issues: no later pass can be allowed to eliminate
dependencies, which requires running extremely late (in the backend) at
which point iterating over the control-flow is painful. So I think this
optimization probably won’t be implemented (soon) despite its potential
benefits.
Cut algorithm
I have seen at least 4 broad categories of algorithms for computing such a
cut.
Encoded as a dataflow problem with bitvectors
This has two main benefits: extremely adaptable (for example can enforce
the conservativeness requirement of classical PRE, or various other
constraints in the literature), and can do up to 64 cuts in parallel
(highly useful for PRE as it must do a cut for each expression, useless for
us). This is what is used in most papers on PRE (most of which are about
clever/unreadable ways of doing that encoding) but I think it would only
make sense for us if we want to reuse the infrastructure to provide
general-purpose PRE.
Boykov-Kolmogorov algorithm
Specially tuned for computer vision problems, with perfectly regular planar
graphs with some special structure: probably useless for us (although easy
to test as it is implemented in boost
<http://www.boost.org/doc/libs/1_56_0/libs/graph/doc/boykov_kolmogorov_max_flow.html>
).
Edmond-Karps algorithm
Probably too slow, but easy to test (boost
<http://www.boost.org/doc/libs/1_56_0/libs/graph/doc/edmonds_karp_max_flow.html>
).
Push-relabel algorithm
Probably the best for our purpose (and easy to test: boost
<http://www.boost.org/doc/libs/1_56_0/libs/graph/doc/push_relabel_max_flow.html>).
However, it is not obvious to me how to enforce the extra constraint
required for conservative PRE, so if we use it we will need profiling
information (or any reasonable static approximation, maybe from
BlockFrequency
<https://github.com/llvm-mirror/llvm/blob/master/docs/BlockFrequencyTerminology.rst>
?).
Pass Ordering
I am currently planning on running this pass just after AtomicExpandPass,
i.e. just before lowering to SelectionDAG. It could in theory be executed
later in the backend, but I would rather avoid doing a global, control-flow
related optimization on SelectionDAGs if at all possible. It cannot be run
earlier, because the target-specific fences do not exist yet.
Receiving information from AtomicExpandPass
As we saw previously in the previous section, several possible improvements
to the graph building require this pass to find where a given Fence comes
from. It is thus not mandatory and won’t be in the first patch, but we
should have plan on how to do it later (incrementally). I see mostly three
ways of doing this (I tried getting some discussion
<http://lists.cs.uiuc.edu/pipermail/llvmdev/2014-August/076326.html> on
this, but without much success). For clarity, I am using DMB here for the
target-dependent fence.
Pass the information directly from AtomicExpandPass to FenceEliminationPass
This would be the easiest, but would utterly break encapsulation, force the
two passes to run together all the time, and generally not respect the
current PassManager-based infrastructure. I don’t think it is acceptable.
Use rich pseudo-instructions instead of DMBs
In this proposal, AtomicExpandPass does not insert DMBs, but various
pseudo-instructions such as FenceAfterAcquire(LoadItWasCreatedBy). This
would work, but require adding some pseudo-instructions, that the
backend/another pass would have to learn how to lower for -O0.
Put metadata on Fences
This is a variant on the previous proposal: AtomicExpandPass keeps
introducing DMBs, but tags them with special metadata describing their
origin. FenceEliminationPass could then make use of this information as
long as it is available, whereas the backend and any other pass would just
automatically ignore it and correctly emit DMBs. Furthermore, if any other
pass introduces/copies DMBs, FenceElimination will automatically be
correctly conservative. This approach seems like the best, and is currently
what I am planning on doing. For the optimizations currently envisioned, a
single piece of metadata llvm.fence.from with one argument being the access
that caused the fence to be emitted would be enough.
Adapting the algorithm to a variety of architecturesx86
There is not much point in doing fence elimination on x86, because only
fence seq_cst is lowered to a mfence. In particular atomic accesses do not
ever introduce mfences, instead seq_cst stores are compiled to lock xchg,
and everything else is just a mov.
ARM
Arm will be my first target, everything in this document should apply to it
straightforwardly.
Power
The main difference between Power and ARM is that Power has both hwsync and
lwsync where ARM only has DMB ISH (it has a few others, but none matching
lwsync exactly). As described previously this is not a big problem, we can
just insert the hwsyncs first, and the lwsyncs second, taking into account
the hwsyncs in the step 2 of the algorithm.
Others
I do not know the other architectures well enough to propose a design for
them. However, the approach looks general enough that it may be used for
Sparc/Mips if someone with the required expertise is motivated to do it.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140911/ffd86f2b/attachment.html>