thr3ads.net - llvm dev - [llvm-dev] [RFC] design doc for straight-line scalar optimizations [Aug 2015]

If this information is useful, please help other people find it:
Share via:

Jingyue Wu via llvm-dev

2015-Aug-25 01:52 UTC

[llvm-dev] [RFC] design doc for straight-line scalar optimizations

Hi Escha,

We certainly would love to generalize them as long as the performance
doesn't suffer in general. If you have specific use cases that are
regressed due to these optimizations, I am more than happy to take a look.

On Mon, Aug 24, 2015 at 6:43 PM, escha <escha at apple.com> wrote:
>
> On Aug 24, 2015, at 11:10 AM, Jingyue Wu via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
> Hi,
>
> As you may have noticed, since last year, we (Google's CUDA compiler
team)
> have contributed quite a lot to the effort of optimizing LLVM for CUDA
> programs. I think it's worthwhile to write some docs to wrap them up
for
> two reasons.
> 1) Whoever wants to understand or work on these optimizations has some
> detailed docs instead of just source code to refer to.
> 2) RFC on how to improve these optimizations so that other targets can
> benefit from them as well. They are currently mostly restricted to the
> NVPTX backend, but I see many potentials to generalize them.
>
> So, I started from this overdue design doc
>
<https://urldefense.proofpoint.com/v2/url?u=https-3A__docs.google.com_document_d_1momWzKFf4D6h8H3YlfgKQ3qeZy5ayvMRh6yR-2DXn2hUE_edit-3Fusp-3Dsharing&d=BQMFaQ&c=eEvniauFctOgLOKGJOplqw&r=szS1_DDBoKCtS8B5df7mJg&m=TggebUNOWYFU5W3tKpC_z1CkNT9MN05aBwWloSru2NI&s=vmPxp-RDJuf_ZN5X7LNlV10JwuHK5Pt1ljn96IenW-o&e=>
on the
> straight-line scalar optimizations. I will send out more docs on other
> optimizations later. Please feel free to comment.
>
> Thanks,
> Jingyue
>
>
> Out of curiosity, is there any plan to make the NVPTX-originated passes
> (separateconstantoffsetfromgep, slsr, naryreassociate) more generic? They
> seem very specialized for the nVidia GPU addressing modes despite the
> generic names, and in my tests tend to pessimize our target more often than
> not for that reason.
>
> It’d be really nice to have something more generic, and I might look into
> helping with that sort of thing in the future if it becomes important for
> us.
>
> —escha
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150824/3c6ee3d0/attachment.html>

escha via llvm-dev

2015-Aug-25 02:05 UTC

head link

[llvm-dev] [RFC] design doc for straight-line scalar optimizations

It’s not a big issue since they’re opt-in passes, naturally, but if we did want
to use them, the main problem is that our addressing modes are of the form:

X + sext(Y) * scale

for i64 X and i32 Y and various (power of 2) scales.

A lot of these passes tend to move around the sexts in ways that prevent them
from being folded into addressing modes. Another thing I’ve noticed (in
separateGEPFromConstantOffset) ends up turning (X + sext(Y + C)) into (X +
sext(Y)) + C. Unfortunately, unless this saves operations due to CSE, it ends up
with us having:

tmp = X + sext(Y)
load with base = X and offset = C

instead of

tmp = Y + C
load with base = X and offset = tmp

Since 64-bit adds are more expensive than 32-bit adds on our GPU, this ends up
being a pessimization.

These are just a few examples I spotted; it’s probably a more general LLVM
problem as a whole that passes which manipulate GEPs and induction variables
tend to be oblivious of addressing modes, or tuned towards a particular sort of
addressing mode.

—escha
> On Aug 24, 2015, at 6:52 PM, Jingyue Wu <jingyue at google.com>
wrote:
> 
> Hi Escha, 
> 
> We certainly would love to generalize them as long as the performance
doesn't suffer in general. If you have specific use cases that are regressed
due to these optimizations, I am more than happy to take a look.
> 
> On Mon, Aug 24, 2015 at 6:43 PM, escha <escha at apple.com
<mailto:escha at apple.com>> wrote:
> 
>> On Aug 24, 2015, at 11:10 AM, Jingyue Wu via llvm-dev <llvm-dev at
lists.llvm.org <mailto:llvm-dev at lists.llvm.org>> wrote:
>> 
>> Hi, 
>> 
>> As you may have noticed, since last year, we (Google's CUDA
compiler team) have contributed quite a lot to the effort of optimizing LLVM for
CUDA programs. I think it's worthwhile to write some docs to wrap them up
for two reasons.
>> 1) Whoever wants to understand or work on these optimizations has some
detailed docs instead of just source code to refer to.
>> 2) RFC on how to improve these optimizations so that other targets can
benefit from them as well. They are currently mostly restricted to the NVPTX
backend, but I see many potentials to generalize them.
>> 
>> So, I started from this overdue design doc
<https://urldefense.proofpoint.com/v2/url?u=https-3A__docs.google.com_document_d_1momWzKFf4D6h8H3YlfgKQ3qeZy5ayvMRh6yR-2DXn2hUE_edit-3Fusp-3Dsharing&d=BQMFaQ&c=eEvniauFctOgLOKGJOplqw&r=szS1_DDBoKCtS8B5df7mJg&m=TggebUNOWYFU5W3tKpC_z1CkNT9MN05aBwWloSru2NI&s=vmPxp-RDJuf_ZN5X7LNlV10JwuHK5Pt1ljn96IenW-o&e=>
on the straight-line scalar optimizations. I will send out more docs on other
optimizations later. Please feel free to comment.
>> 
>> Thanks, 
>> Jingyue
> 
> Out of curiosity, is there any plan to make the NVPTX-originated passes
(separateconstantoffsetfromgep, slsr, naryreassociate) more generic? They seem
very specialized for the nVidia GPU addressing modes despite the generic names,
and in my tests tend to pessimize our target more often than not for that
reason.
> 
> It’d be really nice to have something more generic, and I might look into
helping with that sort of thing in the future if it becomes important for us.
> 
> —escha
> 
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150824/48e1c7fd/attachment.html>

Jingyue Wu via llvm-dev

2015-Aug-25 16:46 UTC

head link

[llvm-dev] [RFC] design doc for straight-line scalar optimizations

That's a very interesting case. I think add more addressing mode checks to
PAR (SeparateConstOffset) may work for your cases. Something like if
"a[i]"
can be folded, don't perform this optimization at all. The CSE benefit can
often be achieved by global reassociation (though it needs to be extended
to handle more binary operators).

On Mon, Aug 24, 2015 at 7:05 PM, escha <escha at apple.com> wrote:
> It’s not a big issue since they’re opt-in passes, naturally, but if we did
> want to use them, the main problem is that our addressing modes are of the
> form:
>
> X + sext(Y) * scale
>
> for i64 X and i32 Y and various (power of 2) scales.
>
> A lot of these passes tend to move around the sexts in ways that prevent
> them from being folded into addressing modes. Another thing I’ve noticed
> (in separateGEPFromConstantOffset) ends up turning (X + sext(Y + C)) into
> (X + sext(Y)) + C. Unfortunately, unless this saves operations due to CSE,
> it ends up with us having:
>
> tmp = X + sext(Y)
> load with base = X and offset = C
>
> instead of
>
> tmp = Y + C
> load with base = X and offset = tmp
>
> Since 64-bit adds are more expensive than 32-bit adds on our GPU, this
> ends up being a pessimization.
>
> These are just a few examples I spotted; it’s probably a more general LLVM
> problem as a whole that passes which manipulate GEPs and induction
> variables tend to be oblivious of addressing modes, or tuned towards a
> particular sort of addressing mode.
>
These limitations come naturally from the fact such passes are initially
added to benefit certain architectures. However, I can't think of any
reasons why we shouldn't fix such issues with the help of TTI.

>
> —escha
>
> On Aug 24, 2015, at 6:52 PM, Jingyue Wu <jingyue at google.com>
wrote:
>
> Hi Escha,
>
> We certainly would love to generalize them as long as the performance
> doesn't suffer in general. If you have specific use cases that are
> regressed due to these optimizations, I am more than happy to take a look.
>
> On Mon, Aug 24, 2015 at 6:43 PM, escha <escha at apple.com> wrote:
>
>>
>> On Aug 24, 2015, at 11:10 AM, Jingyue Wu via llvm-dev <
>> llvm-dev at lists.llvm.org> wrote:
>>
>> Hi,
>>
>> As you may have noticed, since last year, we (Google's CUDA
compiler
>> team) have contributed quite a lot to the effort of optimizing LLVM for
>> CUDA programs. I think it's worthwhile to write some docs to wrap
them up
>> for two reasons.
>> 1) Whoever wants to understand or work on these optimizations has some
>> detailed docs instead of just source code to refer to.
>> 2) RFC on how to improve these optimizations so that other targets can
>> benefit from them as well. They are currently mostly restricted to the
>> NVPTX backend, but I see many potentials to generalize them.
>>
>> So, I started from this overdue design doc
>>
<https://urldefense.proofpoint.com/v2/url?u=https-3A__docs.google.com_document_d_1momWzKFf4D6h8H3YlfgKQ3qeZy5ayvMRh6yR-2DXn2hUE_edit-3Fusp-3Dsharing&d=BQMFaQ&c=eEvniauFctOgLOKGJOplqw&r=szS1_DDBoKCtS8B5df7mJg&m=TggebUNOWYFU5W3tKpC_z1CkNT9MN05aBwWloSru2NI&s=vmPxp-RDJuf_ZN5X7LNlV10JwuHK5Pt1ljn96IenW-o&e=>
on the
>> straight-line scalar optimizations. I will send out more docs on other
>> optimizations later. Please feel free to comment.
>>
>> Thanks,
>> Jingyue
>>
>>
>> Out of curiosity, is there any plan to make the NVPTX-originated passes
>> (separateconstantoffsetfromgep, slsr, naryreassociate) more generic?
They
>> seem very specialized for the nVidia GPU addressing modes despite the
>> generic names, and in my tests tend to pessimize our target more often
than
>> not for that reason.
>>
>> It’d be really nice to have something more generic, and I might look
into
>> helping with that sort of thing in the future if it becomes important
for
>> us.
>>
>> —escha
>>
>
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150825/6a2dd889/attachment.html>

David Chisnall via llvm-dev

2015-Aug-25 17:32 UTC

head link

[llvm-dev] [RFC] design doc for straight-line scalar optimizations

On 25 Aug 2015, at 03:05, escha via llvm-dev <llvm-dev at lists.llvm.org>
wrote:> 
> These are just a few examples I spotted; it’s probably a more general LLVM
problem as a whole that passes which manipulate GEPs and induction variables
tend to be oblivious of addressing modes, or tuned towards a particular sort of
addressing mode.
The canonical form for a loop in LLVM IR is to have a single induction variable
and derive all other variables from it.  This is good for mid-level optimisers,
but often ends up generating quite sub-optimal code on at least some targets. 
For example, turning an add-immediate of another variable in each loop iteration
into a multiply (one instruction, multi-cycle latency) of the loop induction
variable a constant (which takes one instruction to materialise on MIPS, as
there’s no multiply-by-immediate instruction).

If we had some documentation of precisely *what* the canonical forms are for
each phase in the optimisation pipeline, then it would be easier to understand
these effects and know when we need some de-canonicalise pass before codegen.

David

llvm dev - Aug 2015 - [RFC] design doc for straight-line scalar optimizations

[llvm-dev] [RFC] design doc for straight-line scalar optimizations

[llvm-dev] [RFC] design doc for straight-line scalar optimizations

[llvm-dev] [RFC] design doc for straight-line scalar optimizations

[llvm-dev] [RFC] design doc for straight-line scalar optimizations