thr3ads.net - llvm dev - [LLVMdev] [CodeGenPrepare] Sinking incoming values of a PHI Node [May 2014]

If this information is useful, please help other people find it:
Share via:

Jingyue Wu

2014-May-21 01:05 UTC

[LLVMdev] [CodeGenPrepare] Sinking incoming values of a PHI Node

Hi,

I want to improve the way CGP sinks the incoming values of a PHI node
towards memory accesses. Improving it means a lot to some of our key
benchmarks, and I believe can benefit many general cases as well.

CGP's OptimizeMemoryInst function handles PHI nodes by running
AddressingModeMatcher on all incoming values to see whether they reach
consensus on the addressing mode. It does a reasonably good job in sinking
a PHI node if its incoming values are used only once (by the PHI node) and
share the same base register and offset. For example, CGP transforms

define void @foo(i1 %cond, float* %a) {
entry:
  br i1 %cond, label %if.then, label %if.else

if.then:
  %p1 = getelementptr inbounds float* %a, i64 4
  br label %merge

if.else:
  %q1 = getelementptr inbounds float* %a, i64 4
  br label %merge

merge:
  %r1 = phi float* [ %p1, %if.then ], [ %q1, %if.else ]
  %w1 = load float* %r1
  ret void
}

to

define void @foo(i1 %cond, float* %a) {
entry:
  br i1 %cond, label %if.then, label %if.else

if.then:                                          ; preds = %entry
  br label %merge

if.else:                                          ; preds = %entry
  br label %merge

merge:                                            ; preds = %if.else,
%if.then
  %0 = bitcast float* %a to i8*
  %sunkaddr = getelementptr i8* %0, i64 16
  %1 = bitcast i8* %sunkaddr to float*
  %w1 = load float* %1
  ret void
}

saving registers on %p1 and %q1.

However, this approach misses optimization opportunities in two cases.
First, if an incoming value of a PHI node is used in multiple memory
instructions, CGP may not be able to sink it.

For example, if we slightly complicate the above example by adding a load
after %p1,

define void @foo(i1 %cond, float* %a) {
entry:
  br i1 %cond, label %if.then, label %if.else

if.then:
  %p1 = getelementptr inbounds float* %a, i64 4
  %u1 = load float* %p1
  br label %merge

if.else:
  %q1 = getelementptr inbounds float* %a, i64 4
  br label %merge

merge:
  %r1 = phi float* [ %p1, %if.then ], [ %q1, %if.else ]
  %w1 = load float* %r1
  ret void
}

CGP is unable to sink %p1 any more, because, given %p1 is used multiple
times, CGP checks whether %p1 can be folded into every memory uses (in this
case, %u1 = load %p1 and %w1 = load %r1) with the same addressing mode. It
does that by dry-running MatchOperationAddr on both loads, however
MatchOperationAddr bails out on PHI nodes.

The second complication is the incoming values of a PHI node may have
different base registers. For example, we slightly modify the original
example by changing %q1's base to %b.

define void @foo(i1 %cond, float* %a, float* %b) {
entry:
  br i1 %cond, label %if.then, label %if.else

if.then:
  %p1 = getelementptr inbounds float* %a, i64 4
  br label %merge

if.else:
  %q1 = getelementptr inbounds float* %b, i64 4
  br label %merge

merge:
  %r1 = phi float* [ %p1, %if.then ], [ %q1, %if.else ]
  %w1 = load float* %r1
  ret void
}

Ideally we still want to fold %p1 and %q1 into the load by introducing a
new base phi'ing %a and %b, such as

%r = phi float* [ %a, %if.then ], [ %b, %if.else ]
%r1 = getelementptr inbounds float* %r, i64 4
%w1 = load float* %r1

saving registers on %p1 and %q1.

However, the consensus CGP requires all incoming values to reach includes
not only the offset but also the base register. (a + 4) and (b + 4)
unfortunately do not satisfy this condition. Therefore, CGP won't sink %p1
and %q1 toward BB %merge.

Our key benchmarks have both cases I described above. A simplified code
looks like:

define void @foo(i1 %cond, float* %a, float* %b) {
entry:
  br i1 %cond, label %if.then, label %if.else

if.then:
  %p1 = getelementptr inbounds float* %a, i64 4
  %u1 = load float* %p1
  br label %merge

if.else:
  %q1 = getelementptr inbounds float* %b, i64 4
  br label %merge

merge:
  %r1 = phi float* [ %p1, %if.then ], [ %q1, %if.else ]
  %w1 = load float* %r1
  ret void
}

so we need a solution to attack both issues. While we can solve the second
issue by introducing a new base register leveraging SSAUpdater, I am not
sure how to fold this complicated logic in MatchOperationAddr which updates
one single target addressing mode. One possibility is to extend ExtAddrMode
to support multiple base registers, which may complicate a lot of things.

While I am thinking about an appropriate solution, I would like to bring
these issues up to get more feedback. I really appreciate your thoughts on
this.

Thanks!
Jingyue
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140520/6197edaa/attachment.html>

Philip Reames

2014-May-21 18:04 UTC

head link

[LLVMdev] [CodeGenPrepare] Sinking incoming values of a PHI Node

On 05/20/2014 06:05 PM, Jingyue Wu wrote:> Hi,
>
> I want to improve the way CGP sinks the incoming values of a PHI node 
> towards memory accesses. Improving it means a lot to some of our key 
> benchmarks, and I believe can benefit many general cases as well.I found your examples interesting so I'm responding, but keep in mind I 
know very little about the actual code in CGP.  Feel free to tell me I 
don't know what I'm talking about.  :)>
> CGP's OptimizeMemoryInst function handles PHI nodes by running 
> AddressingModeMatcher on all incoming values to see whether they reach 
> consensus on the addressing mode. It does a reasonably good job in 
> sinking a PHI node if its incoming values are used only once (by the 
> PHI node) and share the same base register and offset. For example, 
> CGP transforms
>
> define void @foo(i1 %cond, float* %a) {
> entry:
>   br i1 %cond, label %if.then, label %if.else
>
> if.then:
>   %p1 = getelementptr inbounds float* %a, i64 4
>   br label %merge
>
> if.else:
>   %q1 = getelementptr inbounds float* %a, i64 4
>   br label %merge
>
> merge:
>   %r1 = phi float* [ %p1, %if.then ], [ %q1, %if.else ]
>   %w1 = load float* %r1
>   ret void
> }
>
> to
>
> define void @foo(i1 %cond, float* %a) {
> entry:
>   br i1 %cond, label %if.then, label %if.else
>
> if.then:                                          ; preds = %entry
>   br label %merge
>
> if.else:                                          ; preds = %entry
>   br label %merge
>
> merge:                                            ; preds = %if.else, 
> %if.then
>   %0 = bitcast float* %a to i8*
>   %sunkaddr = getelementptr i8* %0, i64 16
>   %1 = bitcast i8* %sunkaddr to float*
>   %w1 = load float* %1
>   ret void
> }
>
> saving registers on %p1 and %q1.I may be missing something here, but other than the type casts, why is 
this optimization specific to CGP?  It seems like you've essentially 
just folded away a redundant phi.  Shouldn't this be much earlier in the 
optimizer pipeline?>
> However, this approach misses optimization opportunities in two cases. 
> First, if an incoming value of a PHI node is used in multiple memory 
> instructions, CGP may not be able to sink it.
>
> For example, if we slightly complicate the above example by adding a 
> load after %p1,
>
> define void @foo(i1 %cond, float* %a) {
> entry:
>   br i1 %cond, label %if.then, label %if.else
>
> if.then:
>   %p1 = getelementptr inbounds float* %a, i64 4
>   %u1 = load float* %p1
>   br label %merge
>
> if.else:
>   %q1 = getelementptr inbounds float* %a, i64 4
>   br label %merge
>
> merge:
>   %r1 = phi float* [ %p1, %if.then ], [ %q1, %if.else ]
>   %w1 = load float* %r1
>   ret void
> }
>
> CGP is unable to sink %p1 any more, because, given %p1 is used 
> multiple times, CGP checks whether %p1 can be folded into every memory 
> uses (in this case, %u1 = load %p1 and %w1 = load %r1) with the same 
> addressing mode. It does that by dry-running MatchOperationAddr on 
> both loads, however MatchOperationAddr bails out on PHI nodes.To phrase this differently, we currently decide it's not *profitable* if 
there are extra loads right?  I want to make sure there isn't a 
correctness issue I'm missing.

If so, I'm not quite sure why we're requiring the address modes of two 
unrelated loads to match here. We should be able to turn this into:

define void @foo(i1 %cond, float* %a) {
entry:
   br i1 %cond, label %if.then, label %if.else

if.then:
   %p1 = getelementptr inbounds float* %a, i64 4 <-- folded into load u1
   %u1 = load float* %p1
   br label %merge

if.else:
   br label %merge

merge:
   %r1 = getelementptr inbounds float* %a, i64 4 <-- folded into load u2
   %w1 = load float* %r1
   ret void
}

Unless I'm missing something - quite possible! - this would lead to 
better code.  The profitability constraint should be "if each load can 
fold", not "if each load can fold the same way".  Or am I wrong
about this?

Fundamentally, this still seems like a missed optimization in removing 
the phi, not an issue with CGP.
>
> The second complication is the incoming values of a PHI node may have 
> different base registers. For example, we slightly modify the original 
> example by changing %q1's base to %b.
>
> define void @foo(i1 %cond, float* %a, float* %b) {
> entry:
>   br i1 %cond, label %if.then, label %if.else
>
> if.then:
>   %p1 = getelementptr inbounds float* %a, i64 4
>   br label %merge
>
> if.else:
>   %q1 = getelementptr inbounds float* %b, i64 4
>   br label %merge
>
> merge:
>   %r1 = phi float* [ %p1, %if.then ], [ %q1, %if.else ]
>   %w1 = load float* %r1
>   ret void
> }
>
> Ideally we still want to fold %p1 and %q1 into the load by introducing 
> a new base phi'ing %a and %b, such as
>
> %r = phi float* [ %a, %if.then ], [ %b, %if.else ]
> %r1 = getelementptr inbounds float* %r, i64 4
> %w1 = load float* %r1
>
> saving registers on %p1 and %q1.
>
> However, the consensus CGP requires all incoming values to reach 
> includes not only the offset but also the base register. (a + 4) and 
> (b + 4) unfortunately do not satisfy this condition. Therefore, CGP 
> won't sink %p1 and %q1 toward BB %merge.Extending this code to handle separate base values seems quite 
reasonable.  You could generalize this a lot, but starting with 
different base values (i.e. insert a phi, use that) seems like a good 
starting place.

Again, this really sounds like something which should be earlier in the 
optimization pipeline though.  The heart of this is simply removing an 
unnecessary phi.>
> Our key benchmarks have both cases I described above. A simplified 
> code looks like:
>
> define void @foo(i1 %cond, float* %a, float* %b) {
> entry:
>   br i1 %cond, label %if.then, label %if.else
>
> if.then:
>   %p1 = getelementptr inbounds float* %a, i64 4
>   %u1 = load float* %p1
>   br label %merge
>
> if.else:
>   %q1 = getelementptr inbounds float* %b, i64 4
>   br label %merge
>
> merge:
>   %r1 = phi float* [ %p1, %if.then ], [ %q1, %if.else ]
>   %w1 = load float* %r1
>   ret void
> }
>
> so we need a solution to attack both issues. While we can solve the 
> second issue by introducing a new base register leveraging SSAUpdater, 
> I am not sure how to fold this complicated logic in MatchOperationAddr 
> which updates one single target addressing mode. One possibility is to 
> extend ExtAddrMode to support multiple base registers, which may 
> complicate a lot of things.
>
> While I am thinking about an appropriate solution, I would like to 
> bring these issues up to get more feedback. I really appreciate your 
> thoughts on this.
>
> Thanks!
> Jingyue
>
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140521/e571f9de/attachment.html>

Louis Gerbarg

2014-May-21 19:40 UTC

head link

[LLVMdev] [CodeGenPrepare] Sinking incoming values of a PHI Node

It appears to me that most of the optimization cases you are attempting to do
here are handled by an InstCombine patch I wrote last week to handle address
folding across GEPs (r209049 and r209065). Unfortunately there were some
complications so we reverted it. I believe I have a fix for the issues that were
causing failures on the buildbots, but I have not sent it yet because it also
prevents the specific optimization case I had.

I am particularly interested in knowing what benchmarks this helps you with. I
have some more thoughts on how to deal with some related optimization
opportunities, I’ll try to write them up for comments today or tomorrow.

Attached is a combined patch that includes the original commit, the first set of
fixes from friday night, and a final fix that allowed me to pass the reduction
cases sent to me.

Louis

-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0001-Add-support-for-combining-GEPs-across-PHI-nodes.patch
Type: application/octet-stream
Size: 7702 bytes
Desc: not available
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140521/852d8a59/attachment.obj>
-------------- next part --------------

> On May 20, 2014, at 6:05 PM, Jingyue Wu <jingyue at google.com>
wrote:
> 
> Hi, 
> 
> I want to improve the way CGP sinks the incoming values of a PHI node
towards memory accesses. Improving it means a lot to some of our key benchmarks,
and I believe can benefit many general cases as well.
> 
> CGP's OptimizeMemoryInst function handles PHI nodes by running
AddressingModeMatcher on all incoming values to see whether they reach consensus
on the addressing mode. It does a reasonably good job in sinking a PHI node if
its incoming values are used only once (by the PHI node) and share the same base
register and offset. For example, CGP transforms
> 
> define void @foo(i1 %cond, float* %a) {
> entry:
>   br i1 %cond, label %if.then, label %if.else
> 
> if.then:
>   %p1 = getelementptr inbounds float* %a, i64 4
>   br label %merge
> 
> if.else:
>   %q1 = getelementptr inbounds float* %a, i64 4
>   br label %merge
> 
> merge:
>   %r1 = phi float* [ %p1, %if.then ], [ %q1, %if.else ]
>   %w1 = load float* %r1
>   ret void
> }
> 
> to
> 
> define void @foo(i1 %cond, float* %a) {
> entry:
>   br i1 %cond, label %if.then, label %if.else
> 
> if.then:                                          ; preds = %entry
>   br label %merge
> 
> if.else:                                          ; preds = %entry
>   br label %merge
> 
> merge:                                            ; preds = %if.else,
%if.then
>   %0 = bitcast float* %a to i8*
>   %sunkaddr = getelementptr i8* %0, i64 16
>   %1 = bitcast i8* %sunkaddr to float*
>   %w1 = load float* %1
>   ret void
> }
> 
> saving registers on %p1 and %q1. 
> 
> However, this approach misses optimization opportunities in two cases.
First, if an incoming value of a PHI node is used in multiple memory
instructions, CGP may not be able to sink it.
> 
> For example, if we slightly complicate the above example by adding a load
after %p1,
> 
> define void @foo(i1 %cond, float* %a) {
> entry:
>   br i1 %cond, label %if.then, label %if.else
> 
> if.then:
>   %p1 = getelementptr inbounds float* %a, i64 4
>   %u1 = load float* %p1
>   br label %merge
> 
> if.else:
>   %q1 = getelementptr inbounds float* %a, i64 4
>   br label %merge
> 
> merge:
>   %r1 = phi float* [ %p1, %if.then ], [ %q1, %if.else ]
>   %w1 = load float* %r1
>   ret void
> }
> 
> CGP is unable to sink %p1 any more, because, given %p1 is used multiple
times, CGP checks whether %p1 can be folded into every memory uses (in this
case, %u1 = load %p1 and %w1 = load %r1) with the same addressing mode. It does
that by dry-running MatchOperationAddr on both loads, however MatchOperationAddr
bails out on PHI nodes.
> 
> The second complication is the incoming values of a PHI node may have
different base registers. For example, we slightly modify the original example
by changing %q1's base to %b.
> 
> define void @foo(i1 %cond, float* %a, float* %b) {
> entry:
>   br i1 %cond, label %if.then, label %if.else
> 
> if.then:
>   %p1 = getelementptr inbounds float* %a, i64 4
>   br label %merge
> 
> if.else:
>   %q1 = getelementptr inbounds float* %b, i64 4
>   br label %merge
> 
> merge:
>   %r1 = phi float* [ %p1, %if.then ], [ %q1, %if.else ]
>   %w1 = load float* %r1
>   ret void
> }
> 
> Ideally we still want to fold %p1 and %q1 into the load by introducing a
new base phi'ing %a and %b, such as
> 
> %r = phi float* [ %a, %if.then ], [ %b, %if.else ]
> %r1 = getelementptr inbounds float* %r, i64 4
> %w1 = load float* %r1
> 
> saving registers on %p1 and %q1. 
> 
> However, the consensus CGP requires all incoming values to reach includes
not only the offset but also the base register. (a + 4) and (b + 4)
unfortunately do not satisfy this condition. Therefore, CGP won't sink %p1
and %q1 toward BB %merge.
> 
> Our key benchmarks have both cases I described above. A simplified code
looks like:
> 
> define void @foo(i1 %cond, float* %a, float* %b) {
> entry:
>   br i1 %cond, label %if.then, label %if.else
> 
> if.then:
>   %p1 = getelementptr inbounds float* %a, i64 4
>   %u1 = load float* %p1
>   br label %merge
> 
> if.else:
>   %q1 = getelementptr inbounds float* %b, i64 4
>   br label %merge
> 
> merge:
>   %r1 = phi float* [ %p1, %if.then ], [ %q1, %if.else ]
>   %w1 = load float* %r1
>   ret void
> }
> 
> so we need a solution to attack both issues. While we can solve the second
issue by introducing a new base register leveraging SSAUpdater, I am not sure
how to fold this complicated logic in MatchOperationAddr which updates one
single target addressing mode. One possibility is to extend ExtAddrMode to
support multiple base registers, which may complicate a lot of things.
> 
> While I am thinking about an appropriate solution, I would like to bring
these issues up to get more feedback. I really appreciate your thoughts on this.
> 
> Thanks! 
> Jingyue
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev

Philip Reames

2014-May-21 21:17 UTC

head link

[LLVMdev] [CodeGenPrepare] Sinking incoming values of a PHI Node

Louis,

I would say the optimizations discussed here are very similar, but not 
exactly the same as the ones in your change.  Your change could 
definitely be generalized to handle these cases though.

Your change only considers phis which feed GEPs which have GEP inputs.  
To handle Jingyue's cases, we'd need to extend this to (at least) phis 
which feed loads which have GEP inputs.

To prevent code size increase, we'd probably need to limit that use case 
to cases where we could remove at least one of the original input phis.  
In your original use case, you avoided growth by relying on the GEP 
folding of the new GEP and the original GEP. This doesn't apply for the 
extended case.

I would suggest that you get your current optimization in, then we come 
back and generalize it.

Philip

On 05/21/2014 12:40 PM, Louis Gerbarg wrote:> It appears to me that most of the optimization cases you are attempting to
do here are handled by an InstCombine patch I wrote last week to handle address
folding across GEPs (r209049 and r209065). Unfortunately there were some
complications so we reverted it. I believe I have a fix for the issues that were
causing failures on the buildbots, but I have not sent it yet because it also
prevents the specific optimization case I had.
>
> I am particularly interested in knowing what benchmarks this helps you
with. I have some more thoughts on how to deal with some related optimization
opportunities, I'll try to write them up for comments today or tomorrow.
>
> Attached is a combined patch that includes the original commit, the first
set of fixes from friday night, and a final fix that allowed me to pass the
reduction cases sent to me.
>
> Louis
>
>
>
>
>> On May 20, 2014, at 6:05 PM, Jingyue Wu <jingyue at google.com>
wrote:
>>
>> Hi,
>>
>> I want to improve the way CGP sinks the incoming values of a PHI node
towards memory accesses. Improving it means a lot to some of our key benchmarks,
and I believe can benefit many general cases as well.
>>
>> CGP's OptimizeMemoryInst function handles PHI nodes by running
AddressingModeMatcher on all incoming values to see whether they reach consensus
on the addressing mode. It does a reasonably good job in sinking a PHI node if
its incoming values are used only once (by the PHI node) and share the same base
register and offset. For example, CGP transforms
>>
>> define void @foo(i1 %cond, float* %a) {
>> entry:
>>    br i1 %cond, label %if.then, label %if.else
>>
>> if.then:
>>    %p1 = getelementptr inbounds float* %a, i64 4
>>    br label %merge
>>
>> if.else:
>>    %q1 = getelementptr inbounds float* %a, i64 4
>>    br label %merge
>>
>> merge:
>>    %r1 = phi float* [ %p1, %if.then ], [ %q1, %if.else ]
>>    %w1 = load float* %r1
>>    ret void
>> }
>>
>> to
>>
>> define void @foo(i1 %cond, float* %a) {
>> entry:
>>    br i1 %cond, label %if.then, label %if.else
>>
>> if.then:                                          ; preds = %entry
>>    br label %merge
>>
>> if.else:                                          ; preds = %entry
>>    br label %merge
>>
>> merge:                                            ; preds = %if.else,
%if.then
>>    %0 = bitcast float* %a to i8*
>>    %sunkaddr = getelementptr i8* %0, i64 16
>>    %1 = bitcast i8* %sunkaddr to float*
>>    %w1 = load float* %1
>>    ret void
>> }
>>
>> saving registers on %p1 and %q1.
>>
>> However, this approach misses optimization opportunities in two cases.
First, if an incoming value of a PHI node is used in multiple memory
instructions, CGP may not be able to sink it.
>>
>> For example, if we slightly complicate the above example by adding a
load after %p1,
>>
>> define void @foo(i1 %cond, float* %a) {
>> entry:
>>    br i1 %cond, label %if.then, label %if.else
>>
>> if.then:
>>    %p1 = getelementptr inbounds float* %a, i64 4
>>    %u1 = load float* %p1
>>    br label %merge
>>
>> if.else:
>>    %q1 = getelementptr inbounds float* %a, i64 4
>>    br label %merge
>>
>> merge:
>>    %r1 = phi float* [ %p1, %if.then ], [ %q1, %if.else ]
>>    %w1 = load float* %r1
>>    ret void
>> }
>>
>> CGP is unable to sink %p1 any more, because, given %p1 is used multiple
times, CGP checks whether %p1 can be folded into every memory uses (in this
case, %u1 = load %p1 and %w1 = load %r1) with the same addressing mode. It does
that by dry-running MatchOperationAddr on both loads, however MatchOperationAddr
bails out on PHI nodes.
>>
>> The second complication is the incoming values of a PHI node may have
different base registers. For example, we slightly modify the original example
by changing %q1's base to %b.
>>
>> define void @foo(i1 %cond, float* %a, float* %b) {
>> entry:
>>    br i1 %cond, label %if.then, label %if.else
>>
>> if.then:
>>    %p1 = getelementptr inbounds float* %a, i64 4
>>    br label %merge
>>
>> if.else:
>>    %q1 = getelementptr inbounds float* %b, i64 4
>>    br label %merge
>>
>> merge:
>>    %r1 = phi float* [ %p1, %if.then ], [ %q1, %if.else ]
>>    %w1 = load float* %r1
>>    ret void
>> }
>>
>> Ideally we still want to fold %p1 and %q1 into the load by introducing
a new base phi'ing %a and %b, such as
>>
>> %r = phi float* [ %a, %if.then ], [ %b, %if.else ]
>> %r1 = getelementptr inbounds float* %r, i64 4
>> %w1 = load float* %r1
>>
>> saving registers on %p1 and %q1.
>>
>> However, the consensus CGP requires all incoming values to reach
includes not only the offset but also the base register. (a + 4) and (b + 4)
unfortunately do not satisfy this condition. Therefore, CGP won't sink %p1
and %q1 toward BB %merge.
>>
>> Our key benchmarks have both cases I described above. A simplified code
looks like:
>>
>> define void @foo(i1 %cond, float* %a, float* %b) {
>> entry:
>>    br i1 %cond, label %if.then, label %if.else
>>
>> if.then:
>>    %p1 = getelementptr inbounds float* %a, i64 4
>>    %u1 = load float* %p1
>>    br label %merge
>>
>> if.else:
>>    %q1 = getelementptr inbounds float* %b, i64 4
>>    br label %merge
>>
>> merge:
>>    %r1 = phi float* [ %p1, %if.then ], [ %q1, %if.else ]
>>    %w1 = load float* %r1
>>    ret void
>> }
>>
>> so we need a solution to attack both issues. While we can solve the
second issue by introducing a new base register leveraging SSAUpdater, I am not
sure how to fold this complicated logic in MatchOperationAddr which updates one
single target addressing mode. One possibility is to extend ExtAddrMode to
support multiple base registers, which may complicate a lot of things.
>>
>> While I am thinking about an appropriate solution, I would like to
bring these issues up to get more feedback. I really appreciate your thoughts on
this.
>>
>> Thanks!
>> Jingyue
>> _______________________________________________
>> LLVM Developers mailing list
>> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140521/5d5f5f95/attachment.html>

Jingyue Wu

2014-May-21 21:18 UTC

head link

[LLVMdev] [CodeGenPrepare] Sinking incoming values of a PHI Node

Hi Philip,

Thanks for your feedback! Quite thoughtful.

1. Is it CGP-specific or more generic?

I still think it is CGP-specific. The first two examples are probably too
simple, but the third example is more interesting:

(for simplicity, rewriting the example in pseudo-code)

if (cond) {
  p = a + 4
  *p
} else {
  q = b + 4
}
r = phi(p, q)
*r

With my imaginary optimization, the code becomes:

if (cond) {
  p = a + 4
  *p
} else {
  q = b + 4
}
r2 = phi(a, b)
r = r2 + 4
*r

When cond == true, the program will run two additions (p = a + 4 and r = r2
+ 4) instead of one in the original code. Therefore, this transformation
may not be a right thing to do in previous passes.

2. How does CGP decide profitability?

If an address is used (either memory use or non-memory use) multiple times,
folding it to load/store may increase the register pressure. So, to answer
your question, it's a performance consideration instead of a correctness
issue. The comments before IsProfitableToFoldIntoAddressingMode give a very
nice example. Note that, as they said, this profitability computation is
just a rough heuristic and may be too conservative.

/// IsProfitableToFoldIntoAddressingMode - It is possible for the
addressing
/// mode of the machine to fold the specified instruction into a load or
store
/// that ultimately uses it.  However, the specified instruction has
multiple
/// uses.  Given this, it may actually increase register pressure to fold
it
/// into the load.  For example, consider this code:

///

///     X = ...

///     Y = X+1

///     use(Y)   -> nonload/store

///     Z = Y+1

///     load Z

///

/// In this case, Y has multiple uses, and can be folded into the load of Z

/// (yielding load [X+2]).  However, doing this will cause both "X"
and
"X+1" to
/// be live at the use(Y) line.  If we don't fold Y into load Z, we use one

/// fewer register.  Since Y can't be folded into "use(Y)" we
don't
increase the
/// number of computations either.

///

/// Note that this (like most of CodeGenPrepare) is just a rough heuristic.
 If
/// X was live across 'load Z' for other reasons, we actually *would*
want
to
/// fold the addressing mode in the Z case.  This would make Y die earlier.

One exception where they do sink an address with multiple uses is that all
uses of this address are ultimately load/store/inlineasm's. This is
actually the case for our example: both *p and *r are memory uses.
Therefore, the issue here is relatively minor: MatchOperationAddr does not
track PHI nodes, and thus doesn't realize "p = a + 4" can be
folded into
*phi(p, q).

Also, you are right in saying the profitability constraint should be "if
each load can fold", not "if each load can fold the same way". I
missed
some logics there.

Thanks,
Jingyue


On Wed, May 21, 2014 at 11:04 AM, Philip Reames
<listmail at philipreames.com>wrote:
>
> On 05/20/2014 06:05 PM, Jingyue Wu wrote:
>
> Hi,
>
>  I want to improve the way CGP sinks the incoming values of a PHI node
> towards memory accesses. Improving it means a lot to some of our key
> benchmarks, and I believe can benefit many general cases as well.
>
> I found your examples interesting so I'm responding, but keep in mind I
> know very little about the actual code in CGP.  Feel free to tell me I
> don't know what I'm talking about.  :)
>
>
>  CGP's OptimizeMemoryInst function handles PHI nodes by running
> AddressingModeMatcher on all incoming values to see whether they reach
> consensus on the addressing mode. It does a reasonably good job in sinking
> a PHI node if its incoming values are used only once (by the PHI node) and
> share the same base register and offset. For example, CGP transforms
>
>  define void @foo(i1 %cond, float* %a) {
> entry:
>   br i1 %cond, label %if.then, label %if.else
>
>  if.then:
>   %p1 = getelementptr inbounds float* %a, i64 4
>   br label %merge
>
>  if.else:
>   %q1 = getelementptr inbounds float* %a, i64 4
>   br label %merge
>
>  merge:
>   %r1 = phi float* [ %p1, %if.then ], [ %q1, %if.else ]
>   %w1 = load float* %r1
>   ret void
> }
>
>  to
>
>  define void @foo(i1 %cond, float* %a) {
> entry:
>   br i1 %cond, label %if.then, label %if.else
>
>  if.then:                                          ; preds = %entry
>   br label %merge
>
>  if.else:                                          ; preds = %entry
>   br label %merge
>
>  merge:                                            ; preds = %if.else,
> %if.then
>   %0 = bitcast float* %a to i8*
>   %sunkaddr = getelementptr i8* %0, i64 16
>   %1 = bitcast i8* %sunkaddr to float*
>   %w1 = load float* %1
>   ret void
> }
>
>  saving registers on %p1 and %q1.
>
> I may be missing something here, but other than the type casts, why is
> this optimization specific to CGP?  It seems like you've essentially
just
> folded away a redundant phi.  Shouldn't this be much earlier in the
> optimizer pipeline?
>
>
>  However, this approach misses optimization opportunities in two cases.
> First, if an incoming value of a PHI node is used in multiple memory
> instructions, CGP may not be able to sink it.
>
>  For example, if we slightly complicate the above example by adding a
> load after %p1,
>
>   define void @foo(i1 %cond, float* %a) {
> entry:
>   br i1 %cond, label %if.then, label %if.else
>
>  if.then:
>   %p1 = getelementptr inbounds float* %a, i64 4
>   %u1 = load float* %p1
>   br label %merge
>
>  if.else:
>   %q1 = getelementptr inbounds float* %a, i64 4
>   br label %merge
>
>  merge:
>   %r1 = phi float* [ %p1, %if.then ], [ %q1, %if.else ]
>   %w1 = load float* %r1
>   ret void
> }
>
>  CGP is unable to sink %p1 any more, because, given %p1 is used multiple
> times, CGP checks whether %p1 can be folded into every memory uses (in this
> case, %u1 = load %p1 and %w1 = load %r1) with the same addressing mode. It
> does that by dry-running MatchOperationAddr on both loads, however
> MatchOperationAddr bails out on PHI nodes.
>
> To phrase this differently, we currently decide it's not *profitable*
if
> there are extra loads right?  I want to make sure there isn't a
correctness
> issue I'm missing.
>
> If so, I'm not quite sure why we're requiring the address modes of
two
> unrelated loads to match here. We should be able to turn this into:
>
>  define void @foo(i1 %cond, float* %a) {
> entry:
>   br i1 %cond, label %if.then, label %if.else
>
>  if.then:
>   %p1 = getelementptr inbounds float* %a, i64 4 <-- folded into load u1
>    %u1 = load float* %p1
>   br label %merge
>
>  if.else:
>   br label %merge
>
>  merge:
>   %r1 = getelementptr inbounds float* %a, i64 4 <-- folded into load u2
>    %w1 = load float* %r1
>   ret void
> }
>
> Unless I'm missing something - quite possible! - this would lead to
better
> code.  The profitability constraint should be "if each load can
fold", not
> "if each load can fold the same way".  Or am I wrong about this?
>
> Fundamentally, this still seems like a missed optimization in removing the
> phi, not an issue with CGP.
>
>
>  The second complication is the incoming values of a PHI node may have
> different base registers. For example, we slightly modify the original
> example by changing %q1's base to %b.
>
>   define void @foo(i1 %cond, float* %a, float* %b) {
> entry:
>   br i1 %cond, label %if.then, label %if.else
>
>  if.then:
>   %p1 = getelementptr inbounds float* %a, i64 4
>   br label %merge
>
>  if.else:
>   %q1 = getelementptr inbounds float* %b, i64 4
>   br label %merge
>
>  merge:
>   %r1 = phi float* [ %p1, %if.then ], [ %q1, %if.else ]
>   %w1 = load float* %r1
>   ret void
> }
>
>  Ideally we still want to fold %p1 and %q1 into the load by introducing a
> new base phi'ing %a and %b, such as
>
>  %r = phi float* [ %a, %if.then ], [ %b, %if.else ]
> %r1 = getelementptr inbounds float* %r, i64 4
> %w1 = load float* %r1
>
>  saving registers on %p1 and %q1.
>
>  However, the consensus CGP requires all incoming values to reach
> includes not only the offset but also the base register. (a + 4) and (b +
> 4) unfortunately do not satisfy this condition. Therefore, CGP won't
sink
> %p1 and %q1 toward BB %merge.
>
> Extending this code to handle separate base values seems quite
> reasonable.  You could generalize this a lot, but starting with different
> base values (i.e. insert a phi, use that) seems like a good starting
> place.
>
> Again, this really sounds like something which should be earlier in the
> optimization pipeline though.  The heart of this is simply removing an
> unnecessary phi.
>
>
>  Our key benchmarks have both cases I described above. A simplified code
> looks like:
>
>   define void @foo(i1 %cond, float* %a, float* %b) {
> entry:
>   br i1 %cond, label %if.then, label %if.else
>
>  if.then:
>   %p1 = getelementptr inbounds float* %a, i64 4
>   %u1 = load float* %p1
>   br label %merge
>
>  if.else:
>   %q1 = getelementptr inbounds float* %b, i64 4
>   br label %merge
>
>  merge:
>   %r1 = phi float* [ %p1, %if.then ], [ %q1, %if.else ]
>   %w1 = load float* %r1
>   ret void
> }
>
>  so we need a solution to attack both issues. While we can solve the
> second issue by introducing a new base register leveraging SSAUpdater, I am
> not sure how to fold this complicated logic in MatchOperationAddr which
> updates one single target addressing mode. One possibility is to extend
> ExtAddrMode to support multiple base registers, which may complicate a lot
> of things.
>
>  While I am thinking about an appropriate solution, I would like to bring
> these issues up to get more feedback. I really appreciate your thoughts on
> this.
>
>  Thanks!
> Jingyue
>
>
> _______________________________________________
> LLVM Developers mailing listLLVMdev at cs.uiuc.edu        
http://llvm.cs.uiuc.eduhttp://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140521/e58a897b/attachment.html>

Jingyue Wu

2014-May-21 22:25 UTC

head link

[LLVMdev] [CodeGenPrepare] Sinking incoming values of a PHI Node

Hi Louis,

Thanks for posting this! After reading your patch and double-checking my
benchmark, I think a similar logic as the one in your patch would
definitely help.

Some complications for my case:
1. As Philip mentioned, we want to do this for PHI nodes used in
load/store.
2. The incoming values of the PHI node may be bitcasted/addrspacecasted
GEPs and even uglygeps (ptrtoint, add, inttoptr). But I think these are
easy to fix.
3. We may have to run this transformation of combining GEPs in the NVPTX
backend specifically after SeparateConstOffsetFromGEP (a pass I recently
wrote). I am mostly dealing with the GEPs generated in this pass.

Your patch inspires me the possibility of writing another pass that makes
CGP's tasks easier. Based on the current implementation, I may end up with
copying some logics from your patch. Is it acceptable at all?

Jingyue


On Wed, May 21, 2014 at 12:40 PM, Louis Gerbarg <lgg at apple.com> wrote:
> It appears to me that most of the optimization cases you are attempting to
> do here are handled by an InstCombine patch I wrote last week to handle
> address folding across GEPs (r209049 and r209065). Unfortunately there were
> some complications so we reverted it. I believe I have a fix for the issues
> that were causing failures on the buildbots, but I have not sent it yet
> because it also prevents the specific optimization case I had.
>
> I am particularly interested in knowing what benchmarks this helps you
> with. I have some more thoughts on how to deal with some related
> optimization opportunities, I’ll try to write them up for comments today or
> tomorrow.
>
> Attached is a combined patch that includes the original commit, the first
> set of fixes from friday night, and a final fix that allowed me to pass the
> reduction cases sent to me.
>
> Louis
>
>
>
>
> > On May 20, 2014, at 6:05 PM, Jingyue Wu <jingyue at google.com>
wrote:
> >
> > Hi,
> >
> > I want to improve the way CGP sinks the incoming values of a PHI node
> towards memory accesses. Improving it means a lot to some of our key
> benchmarks, and I believe can benefit many general cases as well.
> >
> > CGP's OptimizeMemoryInst function handles PHI nodes by running
> AddressingModeMatcher on all incoming values to see whether they reach
> consensus on the addressing mode. It does a reasonably good job in sinking
> a PHI node if its incoming values are used only once (by the PHI node) and
> share the same base register and offset. For example, CGP transforms
> >
> > define void @foo(i1 %cond, float* %a) {
> > entry:
> >   br i1 %cond, label %if.then, label %if.else
> >
> > if.then:
> >   %p1 = getelementptr inbounds float* %a, i64 4
> >   br label %merge
> >
> > if.else:
> >   %q1 = getelementptr inbounds float* %a, i64 4
> >   br label %merge
> >
> > merge:
> >   %r1 = phi float* [ %p1, %if.then ], [ %q1, %if.else ]
> >   %w1 = load float* %r1
> >   ret void
> > }
> >
> > to
> >
> > define void @foo(i1 %cond, float* %a) {
> > entry:
> >   br i1 %cond, label %if.then, label %if.else
> >
> > if.then:                                          ; preds = %entry
> >   br label %merge
> >
> > if.else:                                          ; preds = %entry
> >   br label %merge
> >
> > merge:                                            ; preds = %if.else,
> %if.then
> >   %0 = bitcast float* %a to i8*
> >   %sunkaddr = getelementptr i8* %0, i64 16
> >   %1 = bitcast i8* %sunkaddr to float*
> >   %w1 = load float* %1
> >   ret void
> > }
> >
> > saving registers on %p1 and %q1.
> >
> > However, this approach misses optimization opportunities in two cases.
> First, if an incoming value of a PHI node is used in multiple memory
> instructions, CGP may not be able to sink it.
> >
> > For example, if we slightly complicate the above example by adding a
> load after %p1,
> >
> > define void @foo(i1 %cond, float* %a) {
> > entry:
> >   br i1 %cond, label %if.then, label %if.else
> >
> > if.then:
> >   %p1 = getelementptr inbounds float* %a, i64 4
> >   %u1 = load float* %p1
> >   br label %merge
> >
> > if.else:
> >   %q1 = getelementptr inbounds float* %a, i64 4
> >   br label %merge
> >
> > merge:
> >   %r1 = phi float* [ %p1, %if.then ], [ %q1, %if.else ]
> >   %w1 = load float* %r1
> >   ret void
> > }
> >
> > CGP is unable to sink %p1 any more, because, given %p1 is used
multiple
> times, CGP checks whether %p1 can be folded into every memory uses (in this
> case, %u1 = load %p1 and %w1 = load %r1) with the same addressing mode. It
> does that by dry-running MatchOperationAddr on both loads, however
> MatchOperationAddr bails out on PHI nodes.
> >
> > The second complication is the incoming values of a PHI node may have
> different base registers. For example, we slightly modify the original
> example by changing %q1's base to %b.
> >
> > define void @foo(i1 %cond, float* %a, float* %b) {
> > entry:
> >   br i1 %cond, label %if.then, label %if.else
> >
> > if.then:
> >   %p1 = getelementptr inbounds float* %a, i64 4
> >   br label %merge
> >
> > if.else:
> >   %q1 = getelementptr inbounds float* %b, i64 4
> >   br label %merge
> >
> > merge:
> >   %r1 = phi float* [ %p1, %if.then ], [ %q1, %if.else ]
> >   %w1 = load float* %r1
> >   ret void
> > }
> >
> > Ideally we still want to fold %p1 and %q1 into the load by introducing
a
> new base phi'ing %a and %b, such as
> >
> > %r = phi float* [ %a, %if.then ], [ %b, %if.else ]
> > %r1 = getelementptr inbounds float* %r, i64 4
> > %w1 = load float* %r1
> >
> > saving registers on %p1 and %q1.
> >
> > However, the consensus CGP requires all incoming values to reach
> includes not only the offset but also the base register. (a + 4) and (b +
> 4) unfortunately do not satisfy this condition. Therefore, CGP won't
sink
> %p1 and %q1 toward BB %merge.
> >
> > Our key benchmarks have both cases I described above. A simplified
code
> looks like:
> >
> > define void @foo(i1 %cond, float* %a, float* %b) {
> > entry:
> >   br i1 %cond, label %if.then, label %if.else
> >
> > if.then:
> >   %p1 = getelementptr inbounds float* %a, i64 4
> >   %u1 = load float* %p1
> >   br label %merge
> >
> > if.else:
> >   %q1 = getelementptr inbounds float* %b, i64 4
> >   br label %merge
> >
> > merge:
> >   %r1 = phi float* [ %p1, %if.then ], [ %q1, %if.else ]
> >   %w1 = load float* %r1
> >   ret void
> > }
> >
> > so we need a solution to attack both issues. While we can solve the
> second issue by introducing a new base register leveraging SSAUpdater, I am
> not sure how to fold this complicated logic in MatchOperationAddr which
> updates one single target addressing mode. One possibility is to extend
> ExtAddrMode to support multiple base registers, which may complicate a lot
> of things.
> >
> > While I am thinking about an appropriate solution, I would like to
bring
> these issues up to get more feedback. I really appreciate your thoughts on
> this.
> >
> > Thanks!
> > Jingyue
> > _______________________________________________
> > LLVM Developers mailing list
> > LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140521/3d637786/attachment.html>

Possibly Parallel Threads

Search for more maybe matching threads

llvm dev - May 2014 - [LLVMdev] [CodeGenPrepare] Sinking incoming values of a PHI Node

[LLVMdev] [CodeGenPrepare] Sinking incoming values of a PHI Node

[LLVMdev] [CodeGenPrepare] Sinking incoming values of a PHI Node

[LLVMdev] [CodeGenPrepare] Sinking incoming values of a PHI Node

[LLVMdev] [CodeGenPrepare] Sinking incoming values of a PHI Node

[LLVMdev] [CodeGenPrepare] Sinking incoming values of a PHI Node

[LLVMdev] [CodeGenPrepare] Sinking incoming values of a PHI Node

Possibly Parallel Threads