thr3ads.net - llvm dev - [LLVMdev] Plan to optimize atomics in LLVM [Aug 2014]

If this information is useful, please help other people find it:
Share via:

Robin Morisset

2014-Aug-05 22:46 UTC

[LLVMdev] Plan to optimize atomics in LLVM

Hello everyone,

I have recently started on optimizing C11/C++11 atomics in LLVM, and plan
to focus on that for the next two months as an intern in the PNaCl team.
I’ve sent two patches on this topic to Phabricator that fix
http://llvm.org/bugs/show_bug.cgi?id=17281:

http://reviews.llvm.org/D4796

http://reviews.llvm.org/D4797


The first patch is X86-specific, and tries to apply operations with
immediates to atomics without going through a register. The main trouble
here is that the X86 backend appears to respect LLVM memory model instead
of the x86-TSO memory model, and may reorder instructions. In order to
prevent illegal reordering of atomic accesses, the backend converts atomic
accesses to pseudo-instructions in X86InstrCompiler.td (RELEASE_MOV* and
ACQUIRE_MOV*) that are opaque to most of the rest of the backend, and only
lowers those at the very end of the pipeline. I have decided to follow the
same approach, just adding some more RELEASE_* pseudo-instructions rather
than trying to find every possibly misbehaving part of the backend in order
to do early lowering. This lowers the risk and complexity of the patch, but
at the cost of possibly missing some optimization possibilities.

Another trouble I had with this patch is a failure of TableGen type
inference when adding negate/not to the scheme. As a result I have left
these two instructions commented out in this patch. Does anyone have an
idea for how to proceed with this ?

The second patch is more straightforward: in the C11/C++11 memory model
(that LLVM basically borrows), optimizations like DSE can safely fire
across atomic accesses in many cases, basically as long as they are not
operating across a release-acquire pair (see paper referenced in the
comments). So I tweaked MemoryDependenceAnalysis to track such pairs and
only return a clobber result in such cases.

My next steps will probably be to improve the compilation of acquire
atomics in the ARM backend. In particular, they are currently compiled by a
load + dmb, while a load + dependant useless branch + isb is also valid
(see http://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html for example) and
may be faster. Even better: if there is already a dependant branch (such as
the loop for the lowering of CAS), it is just a cheap isb. The main step
will be switching off the InsertFencesForAtomic flag, and do the lowering
of atomics in the backend, because once an acquire load has been
transformed in an acquire fence, too much information has been lost to
apply this mapping.

Longer term, I hope to improve the fence elimination of the ARM backend
with a kind of PRE algorithm. Both of these improvements to the ARM backend
should be fairly straightforward to port to the POWER architecture later,
and I hope to also do that.

Does this approach seem worthwhile to you ? Can I do anything to help the
review process ?

Thank you,

Robin Morisset
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140805/ae4e3091/attachment.html>

Philip Reames

2014-Aug-07 23:01 UTC

head link

[LLVMdev] Plan to optimize atomics in LLVM

On 08/05/2014 03:46 PM, Robin Morisset wrote:>
> Hello everyone,
>
>
> I have recently started on optimizing C11/C++11 atomics in LLVM, and 
> plan to focus on that for the next two months as an intern in the 
> PNaCl team. I’ve sent two patches on this topic to Phabricator that 
> fix http://llvm.org/bugs/show_bug.cgi?id=17281:
>
> http://reviews.llvm.org/D4796
>
> http://reviews.llvm.org/D4797
>First, welcome!  This is definitely an area which needs some attention.  
As you've already seen, I'll be happy to help review some of these 
patches.  We will need to find other qualified reviewers though.  As 
you've already seen, some of these issues are *extremely* subtle.

Fair warning, given how challenging this is to get right, and how hard 
it is to debug when wrong, you should expect very slow progress and 
quite a bit of time required for reviews.  Split your reviews into the 
smallest possible pieces you can to give reviewers a chance at 
understanding them.>
>
> The first patch is X86-specific, and tries to apply operations with 
> immediates to atomics without going through a register. The main 
> trouble here is that the X86 backend appears to respect LLVM memory 
> model instead of the x86-TSO memory model, and may reorder 
> instructions. In order to prevent illegal reordering of atomic 
> accesses, the backend converts atomic accesses to pseudo-instructions 
> in X86InstrCompiler.td (RELEASE_MOV* and ACQUIRE_MOV*) that are opaque 
> to most of the rest of the backend, and only lowers those at the very 
> end of the pipeline. I have decided to follow the same approach, just 
> adding some more RELEASE_* pseudo-instructions rather than trying to 
> find every possibly misbehaving part of the backend in order to do 
> early lowering. This lowers the risk and complexity of the patch, but 
> at the cost of possibly missing some optimization possibilities.
>
> Another trouble I had with this patch is a failure of TableGen type 
> inference when adding negate/not to the scheme. As a result I have 
> left these two instructions commented out in this patch. Does anyone 
> have an idea for how to proceed with this ?
>
>
> The second patch is more straightforward: in the C11/C++11 memory 
> model (that LLVM basically borrows), optimizations like DSE can safely 
> fire across atomic accesses in many cases, basically as long as they 
> are not operating across a release-acquire pair (see paper referenced 
> in the comments). So I tweaked MemoryDependenceAnalysis to track such 
> pairs and only return a clobber result in such cases.
>I've already made comments on these in the respective review threads.

Other readers - Please, we need extra reviewers on these!  If you have 
experience reasoning about memory model issues, please chime
in.>
> My next steps will probably be to improve the compilation of acquire 
> atomics in the ARM backend. In particular, they are currently compiled 
> by a load + dmb, while a load + dependant useless branch + isb is also 
> valid (see http://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html 
> <http://www.cl.cam.ac.uk/%7Epes20/cpp/cpp0xmappings.html>for example)
> and may be faster. Even better: if there is already a dependant branch 
> (such as the loop for the lowering of CAS), it is just a cheap isb. 
> The main step will be switching off the InsertFencesForAtomic flag, 
> and do the lowering of atomics in the backend, because once an acquire 
> load has been transformed in an acquire fence, too much information 
> has been lost to apply this mapping.
>I'm unfamiliar with the details of the ARM architecture, so I won't be 
able to help you much here.>
> Longer term, I hope to improve the fence elimination of the ARM 
> backend with a kind of PRE algorithm. Both of these improvements to 
> the ARM backend should be fairly straightforward to port to the POWER 
> architecture later, and I hope to also do that.
>
Any reason these couldn't be done at the IR level?>
> Does this approach seem worthwhile to you ? Can I do anything to help 
> the review process ?
>Small patches.  Lots of justification.  Many test cases.  Ping folks 
periodically.

Philip
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140807/02a0267a/attachment.html>

Tim Northover

2014-Aug-08 06:34 UTC

head link

[LLVMdev] Plan to optimize atomics in LLVM

> Longer term, I hope to improve the fence elimination of the ARM backend
with
> a kind of PRE algorithm. Both of these improvements to the ARM backend
> should be fairly straightforward to port to the POWER architecture later,
> and I hope to also do that.
>
> Any reason these couldn't be done at the IR level?
I definitely agree here. At the time, it was a plausible idea (the
barriers didn't even exist in IR most of the time). But the
implementation was always going to be much more complicated and less
portable than in IR, and what we actually have is very flawed in its
own right (only applies to ARM mode, unmaintained,

Actually, I think we decided to remove it a while back, but I haven't
gotten around to it yet.

Cheers.

Tim.

llvm dev - Aug 2014 - [LLVMdev] Plan to optimize atomics in LLVM

[LLVMdev] Plan to optimize atomics in LLVM

[LLVMdev] Plan to optimize atomics in LLVM

[LLVMdev] Plan to optimize atomics in LLVM