thr3ads.net - llvm dev - [LLVMdev] RFC: Missing canonicalization in LLVM [Apr 2015]

If this information is useful, please help other people find it:
Share via:

Daniel Stewart

2015-Apr-21 16:18 UTC

[LLVMdev] RFC: Missing canonicalization in LLVM

So this change did indeed have an effect! :) 

 

I’m seeing regressions in a number of benchmarks mainly due to a host of extra
bitcasts that get introduced. Here’s the problem I’m seeing in a nutshell:

 

1)      There is a Phi with input type double

2)      Polly demotes the phi into a load/store of type double

3)      InstCombine canonicalizes the load/store to use i64 instead of double

4)      SROA removes the load/store & inserts a phi back in, using i64 as
the type. Inserts bitcast to get to double.

5)      The bitcast sticks around and eventually get translated into FMOVs (for
AArch64 at least).

 

The function findCommonType() in SROA.cpp is used to obtain the type that should
be used for the new alloca that SROA wants to create. It’s decision process is
essentially – if all loads/stores of alloca are the same, use that type; else
use the corresponding integer type. This causes bitcasts to be inserted in a
number of places, most all of which stick around.

 

I’ve copied a reduced version of an instance of the problem below. I’m looking
for comments on what others think is the right solution here. Make SROA more
intelligent about picking the type?

 

The code is below with all unnecessary code removed for easy consumption. 

 

Daniel

Before Polly – Prepare code for polly we have code that looks like:

 

while.cond473:                                    ; preds =
%while.cond473.outer78, %while.body475

  %p_j_x452.0 = phi double [ %105, %while.body475 ], [ %p_j_x452.0.ph82,
%while.cond473.outer78 ]

 

while.body475:                                    ; preds = %while.cond473

  %sub480 = fsub fast double %64, %p_j_x452.0

  %105 = load double* %x485, align 8, !tbaa !25

 

After Polly – Prepare code for polly we have:

 

while.cond473:                                    ; preds =
%while.cond473.outer78, %while.body475

  %p_j_x452.0.reload = load double* %p_j_x452.0.reg2mem

 

while.body475:                                    ; preds = %while.cond473

  %sub480 = fsub fast double %64, %p_j_x452.0.reload

  %110 = load double* %x485, align 8, !tbaa !25

  store double %110, double* %p_j_x452.0.reg2mem

 

After Combine redundant instructions :

 

while.cond473:                                    ; preds =
%while.cond473.outer78, %while.body475

  %p_j_x452.0.reload = load double* %p_j_x452.0.reg2mem, align 8

 

while.body475:                                    ; preds = %while.cond473

  %sub480 = fsub fast double %74, %p_j_x452.0.reload

  %x485 = getelementptr inbounds %struct.CompAtom* %15, i64 %idxprom482, i32 0,
i32 0

  %194 = bitcast double* %x485 to i64*

  %195 = load i64* %194, align 8, !tbaa !25

  %200 = bitcast double* %p_j_x452.0.reg2mem to i64*

  store i64 %195, i64* %200, align 8

 

After SROA :

 

while.cond473:                                    ; preds =
%while.cond473.outer78, %while.body475

  %p_j_x452.0.reg2mem.sroa.0.0.p_j_x452.0.reload362 = phi i64 [
%p_j_x452.0.ph73.reg2mem.sroa.0.0.load368, %while.cond473.outer78 ], [ %178,
%while.body475 ]

  %173 = bitcast i64 %p_j_x452.0.reg2mem.sroa.0.0.p_j_x452.0.reload362 to double

 

while.body475:                                    ; preds = %while.cond473

  %sub480 = fsub fast double %78, %173

  %x485 = getelementptr inbounds %struct.CompAtom* %15, i64 %idxprom482, i32 0,
i32 0

  %177 = bitcast double* %x485 to i64*

  %178 = load i64* %177, align 8, !tbaa !25

 

 

From: llvmdev-bounces at cs.uiuc.edu [mailto:llvmdev-bounces at cs.uiuc.edu] On
Behalf Of Chandler Carruth
Sent: Wednesday, January 21, 2015 8:32 PM
To: Pete Cooper
Cc: LLVM Developers Mailing List
Subject: Re: [LLVMdev] RFC: Missing canonicalization in LLVM

 

 

On Wed, Jan 21, 2015 at 3:06 PM, Pete Cooper <peter_cooper at apple.com
<mailto:peter_cooper at apple.com> > wrote:

Sounds good to me.  Integers it is then.


FYI, thanks, I'm just going to commit this then. It seems we're all in
essential agreement. We can revert it and take a more cautious approach if
something terrible happens. =]

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150421/efc76068/attachment.html>

Pete Cooper

2015-Apr-21 18:00 UTC

head link

[LLVMdev] RFC: Missing canonicalization in LLVM

Hi Daniel

Thanks for the excellent breakdown of whats going on here.

Earlier in the thread on this I made this comment:

"The first thing that springs to mind is that I don’t trust the backend to
get this right.  I don’t think it will understand when an i32 load/store would
have been preferable to a float one or vice versa.  I have no evidence of this,
but given how strongly typed tablegen is, I don’t think it can make a good
choice here.

So I think we probably need to teach the backend how to undo whatever canonical
form we choose if it has a reason to”

Without seeing the machine instructions, its hard to be 100% certain, but the
case you’ve found may be simple enough that the backend can actually fix this.
However, such a fixup would be quite target specific (such a target would need
different register classes for integers and doubles in this case), and we’d need
such a pass for all targets which isn’t ideal.

So i wouldn’t rule out a backend solution, but i have a preference for your
suggestion to improve SROA.

In this particular case, it makes sense for SROA to do effectively the same
analysis InstCombine did here and work out when a load is just raw data vs when
its data is used as a specific type.  The relevant piece of InstCombine is this:

// Try to canonicalize loads which are only ever stored to operate over
  // integers instead of any other type. We only do this when the loaded type
  // is sized and has a size exactly the same as its store size and the store
  // size is a legal integer type.
  if (!Ty->isIntegerTy() && Ty->isSized() &&
      DL.isLegalInteger(DL.getTypeStoreSizeInBits(Ty)) &&
      DL.getTypeStoreSizeInBits(Ty) == DL.getTypeSizeInBits(Ty)) {
    if (std::all_of(LI.user_begin(), LI.user_end(), [&LI](User *U) {
          auto *SI = dyn_cast<StoreInst>(U);
          return SI && SI->getPointerOperand() != &LI;
        })) {
...

After ignoring load/stores which satisfy something like the above code, you can
always fallback to the current code of choosing an integer type, so in the
common case there won’t be any behavior difference.

Cheers
Pete> On Apr 21, 2015, at 9:18 AM, Daniel Stewart <stewartd at
codeaurora.org> wrote:
> 
> So this change did indeed have an effect! J
>  
> I’m seeing regressions in a number of benchmarks mainly due to a host of
extra bitcasts that get introduced. Here’s the problem I’m seeing in a nutshell:
>  
> 1)      There is a Phi with input type double
> 2)      Polly demotes the phi into a load/store of type double
> 3)      InstCombine canonicalizes the load/store to use i64 instead of
double
> 4)      SROA removes the load/store & inserts a phi back in, using i64
as the type. Inserts bitcast to get to double.
> 5)      The bitcast sticks around and eventually get translated into FMOVs
(for AArch64 at least).
>  
> The function findCommonType() in SROA.cpp is used to obtain the type that
should be used for the new alloca that SROA wants to create. It’s decision
process is essentially – if all loads/stores of alloca are the same, use that
type; else use the corresponding integer type. This causes bitcasts to be
inserted in a number of places, most all of which stick around.
>  
> I’ve copied a reduced version of an instance of the problem below. I’m
looking for comments on what others think is the right solution here. Make SROA
more intelligent about picking the type?
>  
> The code is below with all unnecessary code removed for easy consumption. 
>  
> Daniel
> Before Polly – Prepare code for polly we have code that looks like:
>  
> while.cond473:                                    ; preds =
%while.cond473.outer78, %while.body475
>   %p_j_x452.0 = phi double [ %105, %while.body475 ], [ %p_j_x452.0.ph82,
%while.cond473.outer78 ]
>  
> while.body475:                                    ; preds = %while.cond473
>   %sub480 = fsub fast double %64, %p_j_x452.0
>   %105 = load double* %x485, align 8, !tbaa !25
>  
> After Polly – Prepare code for polly we have:
>  
> while.cond473:                                    ; preds =
%while.cond473.outer78, %while.body475
>   %p_j_x452.0.reload = load double* %p_j_x452.0.reg2mem
>  
> while.body475:                                    ; preds = %while.cond473
>   %sub480 = fsub fast double %64, %p_j_x452.0.reload
>   %110 = load double* %x485, align 8, !tbaa !25
>   store double %110, double* %p_j_x452.0.reg2mem
>  
> After Combine redundant instructions :
>  
> while.cond473:                                    ; preds =
%while.cond473.outer78, %while.body475
>   %p_j_x452.0.reload = load double* %p_j_x452.0.reg2mem, align 8
>  
> while.body475:                                    ; preds = %while.cond473
>   %sub480 = fsub fast double %74, %p_j_x452.0.reload
>   %x485 = getelementptr inbounds %struct.CompAtom* %15, i64 %idxprom482,
i32 0, i32 0
>   %194 = bitcast double* %x485 to i64*
>   %195 = load i64* %194, align 8, !tbaa !25
>   %200 = bitcast double* %p_j_x452.0.reg2mem to i64*
>   store i64 %195, i64* %200, align 8
>  
> After SROA :
>  
> while.cond473:                                    ; preds =
%while.cond473.outer78, %while.body475
>   %p_j_x452.0.reg2mem.sroa.0.0.p_j_x452.0.reload362 = phi i64 [
%p_j_x452.0.ph73.reg2mem.sroa.0.0.load368, %while.cond473.outer78 ], [ %178,
%while.body475 ]
>   %173 = bitcast i64 %p_j_x452.0.reg2mem.sroa.0.0.p_j_x452.0.reload362 to
double
>  
> while.body475:                                    ; preds = %while.cond473
>   %sub480 = fsub fast double %78, %173
>   %x485 = getelementptr inbounds %struct.CompAtom* %15, i64 %idxprom482,
i32 0, i32 0
>   %177 = bitcast double* %x485 to i64*
>   %178 = load i64* %177, align 8, !tbaa !25
>  
>  
> From: llvmdev-bounces at cs.uiuc.edu <mailto:llvmdev-bounces at
cs.uiuc.edu> [mailto:llvmdev-bounces at cs.uiuc.edu
<mailto:llvmdev-bounces at cs.uiuc.edu>] On Behalf Of Chandler Carruth
> Sent: Wednesday, January 21, 2015 8:32 PM
> To: Pete Cooper
> Cc: LLVM Developers Mailing List
> Subject: Re: [LLVMdev] RFC: Missing canonicalization in LLVM
>  
>  
> On Wed, Jan 21, 2015 at 3:06 PM, Pete Cooper <peter_cooper at apple.com
<mailto:peter_cooper at apple.com>> wrote:
>> Sounds good to me.  Integers it is then.
> 
> 
> FYI, thanks, I'm just going to commit this then. It seems we're all
in essential agreement. We can revert it and take a more cautious approach if
something terrible happens. =]
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150421/327165e4/attachment.html>

Daniel Stewart

2015-Apr-23 13:17 UTC

head link

[LLVMdev] RFC: Missing canonicalization in LLVM

Thanks for the reply Pete. Unfortunately, I don’t think it is going to be as
simple as ignoring those loads which only store. In findCommonType(), only one
alloca is passed in at a time. So, while you could find those cases where that
alloca was loaded from and stored elsewhere, you can’t find those places that
store to that alloca from somewhere else (at least not easily that I can see).

 

So in my particular example, I could catch the case of only load -> store.
However, there are other stores that use the alloca address, but it is not
readily apparent if they come directly & only from a load.

 

Still trying to figure out the best way forward.

 

Daniel

 

From: Pete Cooper [mailto:peter_cooper at apple.com] 
Sent: Tuesday, April 21, 2015 2:00 PM
To: Daniel Stewart
Cc: LLVM Developers Mailing List; Chandler Carruth
Subject: Re: [LLVMdev] RFC: Missing canonicalization in LLVM

 

Hi Daniel

 

Thanks for the excellent breakdown of whats going on here.

 

Earlier in the thread on this I made this comment:

 

"The first thing that springs to mind is that I don’t trust the backend to
get this right.  I don’t think it will understand when an i32 load/store would
have been preferable to a float one or vice versa.  I have no evidence of this,
but given how strongly typed tablegen is, I don’t think it can make a good
choice here.

 

So I think we probably need to teach the backend how to undo whatever canonical
form we choose if it has a reason to”

 

Without seeing the machine instructions, its hard to be 100% certain, but the
case you’ve found may be simple enough that the backend can actually fix this.
However, such a fixup would be quite target specific (such a target would need
different register classes for integers and doubles in this case), and we’d need
such a pass for all targets which isn’t ideal.

 

So i wouldn’t rule out a backend solution, but i have a preference for your
suggestion to improve SROA.

 

In this particular case, it makes sense for SROA to do effectively the same
analysis InstCombine did here and work out when a load is just raw data vs when
its data is used as a specific type.  The relevant piece of InstCombine is this:

 

// Try to canonicalize loads which are only ever stored to operate over

  // integers instead of any other type. We only do this when the loaded type

  // is sized and has a size exactly the same as its store size and the store

  // size is a legal integer type.

  if (!Ty->isIntegerTy() && Ty->isSized() &&

      DL.isLegalInteger(DL.getTypeStoreSizeInBits(Ty)) &&

      DL.getTypeStoreSizeInBits(Ty) == DL.getTypeSizeInBits(Ty)) {

    if (std::all_of(LI.user_begin(), LI.user_end(), [&LI](User *U) {

          auto *SI = dyn_cast<StoreInst>(U);

          return SI && SI->getPointerOperand() != &LI;

        })) {

...

 

After ignoring load/stores which satisfy something like the above code, you can
always fallback to the current code of choosing an integer type, so in the
common case there won’t be any behavior difference.

 

Cheers

Pete

On Apr 21, 2015, at 9:18 AM, Daniel Stewart <stewartd at codeaurora.org
<mailto:stewartd at codeaurora.org> > wrote:

 

So this change did indeed have an effect! :)

 

I’m seeing regressions in a number of benchmarks mainly due to a host of extra
bitcasts that get introduced. Here’s the problem I’m seeing in a nutshell:

 

1)      There is a Phi with input type double

2)      Polly demotes the phi into a load/store of type double

3)      InstCombine canonicalizes the load/store to use i64 instead of double

4)      SROA removes the load/store & inserts a phi back in, using i64 as
the type. Inserts bitcast to get to double.

5)      The bitcast sticks around and eventually get translated into FMOVs (for
AArch64 at least).

 

The function findCommonType() in SROA.cpp is used to obtain the type that should
be used for the new alloca that SROA wants to create. It’s decision process is
essentially – if all loads/stores of alloca are the same, use that type; else
use the corresponding integer type. This causes bitcasts to be inserted in a
number of places, most all of which stick around.

 

I’ve copied a reduced version of an instance of the problem below. I’m looking
for comments on what others think is the right solution here. Make SROA more
intelligent about picking the type?

 

The code is below with all unnecessary code removed for easy consumption. 

 

Daniel

Before Polly – Prepare code for polly we have code that looks like:

 

while.cond473:                                    ; preds =
%while.cond473.outer78, %while.body475

  %p_j_x452.0 = phi double [ %105, %while.body475 ], [ %p_j_x452.0.ph82,
%while.cond473.outer78 ]

 

while.body475:                                    ; preds = %while.cond473

  %sub480 = fsub fast double %64, %p_j_x452.0

  %105 = load double* %x485, align 8, !tbaa !25

 

After Polly – Prepare code for polly we have:

 

while.cond473:                                    ; preds =
%while.cond473.outer78, %while.body475

  %p_j_x452.0.reload = load double* %p_j_x452.0.reg2mem

 

while.body475:                                    ; preds = %while.cond473

  %sub480 = fsub fast double %64, %p_j_x452.0.reload

  %110 = load double* %x485, align 8, !tbaa !25

  store double %110, double* %p_j_x452.0.reg2mem

 

After Combine redundant instructions :

 

while.cond473:                                    ; preds =
%while.cond473.outer78, %while.body475

  %p_j_x452.0.reload = load double* %p_j_x452.0.reg2mem, align 8

 

while.body475:                                    ; preds = %while.cond473

  %sub480 = fsub fast double %74, %p_j_x452.0.reload

  %x485 = getelementptr inbounds %struct.CompAtom* %15, i64 %idxprom482, i32 0,
i32 0

  %194 = bitcast double* %x485 to i64*

  %195 = load i64* %194, align 8, !tbaa !25

  %200 = bitcast double* %p_j_x452.0.reg2mem to i64*

  store i64 %195, i64* %200, align 8

 

After SROA :

 

while.cond473:                                    ; preds =
%while.cond473.outer78, %while.body475

  %p_j_x452.0.reg2mem.sroa.0.0.p_j_x452.0.reload362 = phi i64 [
%p_j_x452.0.ph73.reg2mem.sroa.0.0.load368, %while.cond473.outer78 ], [ %178,
%while.body475 ]

  %173 = bitcast i64 %p_j_x452.0.reg2mem.sroa.0.0.p_j_x452.0.reload362 to double

 

while.body475:                                    ; preds = %while.cond473

  %sub480 = fsub fast double %78, %173

  %x485 = getelementptr inbounds %struct.CompAtom* %15, i64 %idxprom482, i32 0,
i32 0

  %177 = bitcast double* %x485 to i64*

  %178 = load i64* %177, align 8, !tbaa !25

 

 

From:  <mailto:llvmdev-bounces at cs.uiuc.edu> llvmdev-bounces at
cs.uiuc.edu [ <mailto:llvmdev-bounces at cs.uiuc.edu>
mailto:llvmdev-bounces at cs.uiuc.edu] On Behalf Of Chandler Carruth
Sent: Wednesday, January 21, 2015 8:32 PM
To: Pete Cooper
Cc: LLVM Developers Mailing List
Subject: Re: [LLVMdev] RFC: Missing canonicalization in LLVM

 

 

On Wed, Jan 21, 2015 at 3:06 PM, Pete Cooper < <mailto:peter_cooper at
apple.com> peter_cooper at apple.com> wrote:

Sounds good to me.  Integers it is then.


FYI, thanks, I'm just going to commit this then. It seems we're all in
essential agreement. We can revert it and take a more cautious approach if
something terrible happens. =]

 

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150423/059fbc7c/attachment.html>

Apparently Analagous Threads

Search for more reasonably related threads

llvm dev - Apr 2015 - [LLVMdev] RFC: Missing canonicalization in LLVM

[LLVMdev] RFC: Missing canonicalization in LLVM

[LLVMdev] RFC: Missing canonicalization in LLVM

[LLVMdev] RFC: Missing canonicalization in LLVM

Apparently Analagous Threads