thr3ads.net - llvm dev - [llvm-dev] [VSXFMAMutate] OldFMAReg may be wrongly rewritten [Mar 2016]

If this information is useful, please help other people find it:
Share via:

Tim Shen via llvm-dev

2016-Mar-16 17:53 UTC

[llvm-dev] [VSXFMAMutate] OldFMAReg may be wrongly rewritten

I implemented a proof of concept of a new generic MachineFunction SSA pass.
The code is not readable and not efficient yet, but it shows interesting
results:

In fma.ll @test_FMSUB2 (return dummy(A * B + C, A * B - D)):
before:
        fmr 0, 1
        xsmaddadp 3, 0, 2
        xsmsubmdp 0, 2, 4
        fmr 1, 3
        fmr 2, 0
        bl dummy2

after:
        xsmsubadp 4, 1, 2
        xsmaddmdp 1, 2, 3
        fmr 2, 4
        bl dummy2

In fma-assoc.ll all 9 copies are eliminated, for example (A * B + C * D +
E):
before:
        xsmaddmdp 3, 4, 5
        xsmaddadp 3, 1, 2
        fmr 1, 3

after:
        xsmaddadp 5, 3, 4
        xsmaddmdp 1, 2, 5

In fma-ext.ll, the only copy is also eliminated.

The new pass in theory can support any kinds of instructions that may pick
one among multiple its operands as an in-out register. VSX FMA instructions
are just some examples.

The drawback is this pass requires LiveIntervals analysis at SSA form and
Machine MachineDominatorTree analysis. It definitely preserves the former,
but not sure about the latter - it rearranges operands and re-tie them.

If you think this result is good enough, I'll go ahead and do a bit cleanup
and code optimization. It's in theory as fast as a topological sort.

On Fri, Mar 4, 2016 at 5:14 PM Tim Shen <timshen at google.com> wrote:
> In 5), by saying "In the same block" I mean we should replace
uses of
> vreg0 with vreg2 afterwards, but only limited in the same basicblock.
>
> In 6), we need another tiny piece to actually lower MutatingFMA to real
> forms, which should be trivial.
>
> On Fri, Mar 4, 2016 at 5:09 PM Tim Shen <timshen at google.com>
wrote:
>
>> I wonder if we can do this in a separate analysis MachineFunction SSA
>> pass.
>> 1) SelectionDAG will generate a pseudo instruction MutatingFMA. When
it's
>> generated it's allowed to have d = a * b + c form, where d
doesn't have to
>> be in {a, b, c}.
>> 2) Later, the proposed pass uses an algorithm to decide for instruction
>> MI: `%vreg0 = MutatingFMA %vreg1, %vreg2, %vreg3`, it should tie %vreg0
>> with, say %vreg2. Then it tells the scheduler (through ScheduleDAG?)
"If
>> you put instruction MI as the last use of %vreg2, the killed %vreg2
will
>> save us a copy". The algorithm saves as many copies as possible,
and those
>> suggestions are guarantee to be compatible with each other, so that
>> scheduler can choose to follow all of them.
>> 3) In Two-Address instruction pass, all MutatingFMA instructions are
>> still handled and copies get inserted. That's fine. No change is
required.
>> 4) Scheduler optionally follows our suggestions.
>> 5) After Scheduling, we do a copy cleanup pass that change:
>>     %vreg0<def> = COPY %vreg2<kill>
>>     %vreg0<def, tied2> = MutatingFMA %vreg1, %vreg0<tied0>,
%vreg3
>>     To:
>>     %vreg2<def, tie2> = MutatingFMA %vreg1, %vreg2<tied0>,
%vreg3
>>     In the same block.
>> 6) 1)~5) are target independent. Now the tied MutatingFMA instruction
>> gets lowered to one of its real forms.
>>
>> The only thing left is to formalize the problem and come up with an
>> algorithm in 2):
>> Given a DAG of uninteresting nodes and interesting (e.g. MutatingFMA)
>> nodes, for each interesting nodes, select zero or one of its operand
uses
>> (corresponding to an edge) to mark. Mark as many edges as possible.
>>
>> I can't come up with an optimal algorithm for 2). Here's my
current
>> greedy solution:
>> 1) For each vreg, calculate how many MutatingFMA instructions are using
>> them, and put the result into ref_count. Set up a worklist that's
holding
>> all MutatingFMA instructions.
>> 2) Try to get a MutatingFMA from worklist, as MI, such that, it has at
>> least one operand %x that has ref_count 1. If we can't find one, go
to 4).
>> 3) Make a suggestion about MI and %x. Remove MI from active set and
>> decrease the ref_count for each of its uses by 1.
>> 4) If active set is empty, then we are done; otherwise remove a random
>> (in a heuristic manner) MutatingFMA from worklist (by making no
>> suggestions, aka keeping the copy) and go back to 2).
>>
>> For small amount (e.g. 5) of instructions, we can also enumerate
>> copy/no-copy decisions for each of them. That will take 2^5 = 32
different
>> cases, which isn't that many.
>>
>> Another concern is that ABI may require values to be put to designated
>> physical registers (e.g. put return value to F1). It would be good to
plumb
>> a path from F1 as an input, finally to F1 as an output, for example:
>> f = a * b + c * d + e
>>
>> it's better to do:
>> e = FMA c d e
>> a = FMA a b e
>> where the first will be an add form and the second will be a multiply
>> form. The greedy algorithm isn't aware of this, but the exponential
>> algorithm can pick this up.
>>
>> The algorithm is unaware of any FMA related knowledge though, so it can
>> be used for any mixed instructions with in-out registers.
>>
>> On Wed, Mar 2, 2016 at 3:40 PM Eric Christopher <echristo at
gmail.com>
>> wrote:
>>
>>> Relaying an IRC conversation here:
>>>
>>> Currently we're going to disable the VSXFMAMutate pass in ToT
and talk
>>> about ways to either rewrite or fix it to deal with more
complicated cfgs.
>>>
>>> Thanks!
>>>
>>> -eric
>>>
>>> On Mon, Feb 29, 2016 at 2:21 PM Tim Shen <timshen at
google.com> wrote:
>>>
>>>> Ping?
>>>>
>>>> On Mon, Feb 22, 2016 at 1:06 PM Tim Shen <timshen at
google.com> wrote:
>>>>
>>>>> On Fri, Feb 19, 2016 at 5:10 PM Tim Shen <timshen at
google.com> wrote:
>>>>>
>>>>>> I wonder if we can fix this by making the
transformation simpler,
>>>>>> that is, instead of doing:
>>>>>>
>>>>>
>>>>> I wrote a prototype (see attach) for this idea, it actually
improves
>>>>> some of the test cases (e.g. fma-assoc.ll:
test_FMADD_ASSOC1), but
>>>>> pessimize several other cases (e.g. test_FMADD_ASSOC_EXT1).
>>>>>
>>>>> I'm not sure what to do at this point, I have several
options:
>>>>> 1) This pass simply omits the optimization for certain
cases
>>>>> ("certain" needs to be defined).
>>>>> 2) This pass should never mutate anything out of the block,
and may
>>>>> add a copy at the end of the block, if the value is
live-out.
>>>>> 3) tune the transformation as I attempted, to make it
simpler,
>>>>> working, and producing better result? It seems interesting
to get the
>>>>> improvement as my prototype shown without hurting other
test cases, but it
>>>>> needs a bit thinking and I'd like to fix the compiler
crash soon.
>>>>>
>>>>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160316/7b705fcd/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: a.diff
Type: application/octet-stream
Size: 17974 bytes
Desc: not available
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160316/7b705fcd/attachment-0001.obj>

Eric Christopher via llvm-dev

2016-Mar-23 00:13 UTC

head link

[llvm-dev] [VSXFMAMutate] OldFMAReg may be wrongly rewritten

I think we can probably go ahead and throw this up on Phabricator for
review. I'd probably bring in Matthias for review as well.

Thanks!

-eric

On Wed, Mar 16, 2016 at 10:53 AM Tim Shen <timshen at google.com> wrote:
> I implemented a proof of concept of a new generic MachineFunction SSA
> pass. The code is not readable and not efficient yet, but it shows
> interesting results:
>
> In fma.ll @test_FMSUB2 (return dummy(A * B + C, A * B - D)):
> before:
>         fmr 0, 1
>         xsmaddadp 3, 0, 2
>         xsmsubmdp 0, 2, 4
>         fmr 1, 3
>         fmr 2, 0
>         bl dummy2
>
> after:
>         xsmsubadp 4, 1, 2
>         xsmaddmdp 1, 2, 3
>         fmr 2, 4
>         bl dummy2
>
> In fma-assoc.ll all 9 copies are eliminated, for example (A * B + C * D +
> E):
> before:
>         xsmaddmdp 3, 4, 5
>         xsmaddadp 3, 1, 2
>         fmr 1, 3
>
> after:
>         xsmaddadp 5, 3, 4
>         xsmaddmdp 1, 2, 5
>
> In fma-ext.ll, the only copy is also eliminated.
>
> The new pass in theory can support any kinds of instructions that may pick
> one among multiple its operands as an in-out register. VSX FMA instructions
> are just some examples.
>
> The drawback is this pass requires LiveIntervals analysis at SSA form and
> Machine MachineDominatorTree analysis. It definitely preserves the former,
> but not sure about the latter - it rearranges operands and re-tie them.
>
> If you think this result is good enough, I'll go ahead and do a bit
> cleanup and code optimization. It's in theory as fast as a topological
sort.
>
> On Fri, Mar 4, 2016 at 5:14 PM Tim Shen <timshen at google.com>
wrote:
>
>> In 5), by saying "In the same block" I mean we should replace
uses of
>> vreg0 with vreg2 afterwards, but only limited in the same basicblock.
>>
>> In 6), we need another tiny piece to actually lower MutatingFMA to real
>> forms, which should be trivial.
>>
>> On Fri, Mar 4, 2016 at 5:09 PM Tim Shen <timshen at google.com>
wrote:
>>
>>> I wonder if we can do this in a separate analysis MachineFunction
SSA
>>> pass.
>>> 1) SelectionDAG will generate a pseudo instruction MutatingFMA.
When
>>> it's generated it's allowed to have d = a * b + c form,
where d doesn't
>>> have to be in {a, b, c}.
>>> 2) Later, the proposed pass uses an algorithm to decide for
instruction
>>> MI: `%vreg0 = MutatingFMA %vreg1, %vreg2, %vreg3`, it should tie
%vreg0
>>> with, say %vreg2. Then it tells the scheduler (through
ScheduleDAG?) "If
>>> you put instruction MI as the last use of %vreg2, the killed %vreg2
will
>>> save us a copy". The algorithm saves as many copies as
possible, and those
>>> suggestions are guarantee to be compatible with each other, so that
>>> scheduler can choose to follow all of them.
>>> 3) In Two-Address instruction pass, all MutatingFMA instructions
are
>>> still handled and copies get inserted. That's fine. No change
is required.
>>> 4) Scheduler optionally follows our suggestions.
>>> 5) After Scheduling, we do a copy cleanup pass that change:
>>>     %vreg0<def> = COPY %vreg2<kill>
>>>     %vreg0<def, tied2> = MutatingFMA %vreg1,
%vreg0<tied0>, %vreg3
>>>     To:
>>>     %vreg2<def, tie2> = MutatingFMA %vreg1,
%vreg2<tied0>, %vreg3
>>>     In the same block.
>>> 6) 1)~5) are target independent. Now the tied MutatingFMA
instruction
>>> gets lowered to one of its real forms.
>>>
>>> The only thing left is to formalize the problem and come up with an
>>> algorithm in 2):
>>> Given a DAG of uninteresting nodes and interesting (e.g.
MutatingFMA)
>>> nodes, for each interesting nodes, select zero or one of its
operand uses
>>> (corresponding to an edge) to mark. Mark as many edges as possible.
>>>
>>> I can't come up with an optimal algorithm for 2). Here's my
current
>>> greedy solution:
>>> 1) For each vreg, calculate how many MutatingFMA instructions are
using
>>> them, and put the result into ref_count. Set up a worklist
that's holding
>>> all MutatingFMA instructions.
>>> 2) Try to get a MutatingFMA from worklist, as MI, such that, it has
at
>>> least one operand %x that has ref_count 1. If we can't find
one, go to 4).
>>> 3) Make a suggestion about MI and %x. Remove MI from active set and
>>> decrease the ref_count for each of its uses by 1.
>>> 4) If active set is empty, then we are done; otherwise remove a
random
>>> (in a heuristic manner) MutatingFMA from worklist (by making no
>>> suggestions, aka keeping the copy) and go back to 2).
>>>
>>> For small amount (e.g. 5) of instructions, we can also enumerate
>>> copy/no-copy decisions for each of them. That will take 2^5 = 32
different
>>> cases, which isn't that many.
>>>
>>> Another concern is that ABI may require values to be put to
designated
>>> physical registers (e.g. put return value to F1). It would be good
to plumb
>>> a path from F1 as an input, finally to F1 as an output, for
example:
>>> f = a * b + c * d + e
>>>
>>> it's better to do:
>>> e = FMA c d e
>>> a = FMA a b e
>>> where the first will be an add form and the second will be a
multiply
>>> form. The greedy algorithm isn't aware of this, but the
exponential
>>> algorithm can pick this up.
>>>
>>> The algorithm is unaware of any FMA related knowledge though, so it
can
>>> be used for any mixed instructions with in-out registers.
>>>
>>> On Wed, Mar 2, 2016 at 3:40 PM Eric Christopher <echristo at
gmail.com>
>>> wrote:
>>>
>>>> Relaying an IRC conversation here:
>>>>
>>>> Currently we're going to disable the VSXFMAMutate pass in
ToT and talk
>>>> about ways to either rewrite or fix it to deal with more
complicated cfgs.
>>>>
>>>> Thanks!
>>>>
>>>> -eric
>>>>
>>>> On Mon, Feb 29, 2016 at 2:21 PM Tim Shen <timshen at
google.com> wrote:
>>>>
>>>>> Ping?
>>>>>
>>>>> On Mon, Feb 22, 2016 at 1:06 PM Tim Shen <timshen at
google.com> wrote:
>>>>>
>>>>>> On Fri, Feb 19, 2016 at 5:10 PM Tim Shen <timshen at
google.com> wrote:
>>>>>>
>>>>>>> I wonder if we can fix this by making the
transformation simpler,
>>>>>>> that is, instead of doing:
>>>>>>>
>>>>>>
>>>>>> I wrote a prototype (see attach) for this idea, it
actually improves
>>>>>> some of the test cases (e.g. fma-assoc.ll:
test_FMADD_ASSOC1), but
>>>>>> pessimize several other cases (e.g.
test_FMADD_ASSOC_EXT1).
>>>>>>
>>>>>> I'm not sure what to do at this point, I have
several options:
>>>>>> 1) This pass simply omits the optimization for certain
cases
>>>>>> ("certain" needs to be defined).
>>>>>> 2) This pass should never mutate anything out of the
block, and may
>>>>>> add a copy at the end of the block, if the value is
live-out.
>>>>>> 3) tune the transformation as I attempted, to make it
simpler,
>>>>>> working, and producing better result? It seems
interesting to get the
>>>>>> improvement as my prototype shown without hurting other
test cases, but it
>>>>>> needs a bit thinking and I'd like to fix the
compiler crash soon.
>>>>>>
>>>>>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160323/136da5b1/attachment-0001.html>

Tim Shen via llvm-dev

2016-Mar-25 01:16 UTC

head link

[llvm-dev] [VSXFMAMutate] OldFMAReg may be wrongly rewritten

On Tue, Mar 22, 2016 at 5:13 PM Eric Christopher wrote:
> I think we can probably go ahead and throw this up on Phabricator for
> review. I'd probably bring in Matthias for review as well.
>
> Thanks!
>
> -eric
>
(Follow up on the discuss in IRC)

I tried to bridge PPC backend and commuteInstruction, not sure if
correctly, but here I got some non-optimal results: in 12 differences, 8 of
them are improvements, but 3 of them are regressions.

If I implemented commuteInstruction correctly, and current two-address
isn't picking up all of them, I guess my pass is still worth trying (it
catches all these cases), as long as it's ultimately "melt into"
two-address ?
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160325/dcc19e11/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: bridge-no-commit.diff
Type: application/octet-stream
Size: 10238 bytes
Desc: not available
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160325/dcc19e11/attachment.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: result.diff
Type: application/octet-stream
Size: 2298 bytes
Desc: not available
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160325/dcc19e11/attachment-0001.obj>

Seemingly Similar Threads

Search for more reasonably related threads

llvm dev - Mar 2016 - [VSXFMAMutate] OldFMAReg may be wrongly rewritten

[llvm-dev] [VSXFMAMutate] OldFMAReg may be wrongly rewritten

[llvm-dev] [VSXFMAMutate] OldFMAReg may be wrongly rewritten

[llvm-dev] [VSXFMAMutate] OldFMAReg may be wrongly rewritten

Seemingly Similar Threads