thr3ads.net - llvm dev - [LLVMdev] Question about shouldMergeGEPs in InstructionCombining [Feb 2015]

If this information is useful, please help other people find it:
Share via:

Francois Pichet

2015-Feb-24 12:17 UTC

[LLVMdev] Question about shouldMergeGEPs in InstructionCombining

On Mon, Feb 23, 2015 at 2:17 PM, Hal Finkel <hfinkel at anl.gov> wrote:
> ----- Original Message -----
> > From: "Francois Pichet" <pichet2000 at gmail.com>
> > To: "LLVM Developers Mailing List" <llvmdev at
cs.uiuc.edu>
> > Sent: Sunday, February 22, 2015 5:34:11 PM
> > Subject: [LLVMdev] Question about shouldMergeGEPs in
InstructionCombining
> >
> > Hello
> >
> > I am not sure I understand the logic for merging GEPs in
> > InstructionCombining.cpp:
> >
> > static bool shouldMergeGEPs (GEPOperator &GEP, GEPOperator
&Src) {
> > // If this GEP has only 0 indices, it is the same pointer as
> > // Src. If Src is not a trivial GEP too, don't combine
> > // the indices.
> > if (GEP.hasAllZeroIndices() && !Src.hasAllZeroIndices()
&&
> > !Src.hasOneUse())
> > return false;
> > return true;
> > }
> >
> >
> >
> > I have a case where
> > GEP: %arrayidx7 = getelementptr inbounds i32* %arrayidx, i32 %shl6
> > Src: %arrayidx = getelementptr inbounds [4096 x i32]* @phasor_4096,
> > i32 0, i32 %shl2
> >
> >
> > GEP. hasAllZeroIndices() will return false and the merge will occur
> > Why do we want to combine these 2 getelementptr?
> >
> >
> > On my out of tree target, combining these 2 GetElementPtr create a
> > performance regression because since GEP is in a loop (Src is out of
> > loop), GEP will lower to a more complicated address for a subsequent
> > load. (the complicated address needs to be calculated over and over
> > in the loop)
> >
>
> There are a couple of issues here. One, InstCombine's job is the move
the
> IR toward a reasonable canonical form. It is doing that here, and I think
> that's the right thing to do. However, the problem you point out is a
> general one. Can you please explain why the MachineLICM pass is not able to
> hoist the redundant parts of the addressing computation out of the loop? We
> might also want to adjust for this in CodeGenPrep (CGP already has
> addressing-mode aware GEP splitting logic, although not quite for this
> case).
>
>Hi Hal,

MachineLICM is not able to hoist anything because the address mode is not
loop invariant.

Here is a reduction of the code I am talking about.

extern const unsigned   phasor[4096];
void test(unsigned* out , unsigned step_size)
{
  unsigned big_step_size = step_size<<2;
  int *phasor_ptr_temp_1 = &phasor[big_step_size];
  for (int i = 0 ; i < 1020 ; i+=4)
  out[i] = phasor_ptr_temp_1[i<<step_size];
}

I am getting slightly better code on my target (Octasic's Opus) if I return
false for shouldMergeGEPs.
I just tried with ppc64 and x86-64 and I am also getting better code
without the GEP merging in InstructionCombining. I am not sure what the
solution is yet but I think we are too aggressive when merging GEPs in
InstructionCombining.

Here is the details for ppc64: http://pastie.org/9978022
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150224/606b7b39/attachment.html>

Hal Finkel

2015-Feb-25 05:27 UTC

head link

[LLVMdev] Question about shouldMergeGEPs in InstructionCombining

w----- Original Message -----> From: "Francois Pichet" <pichet2000 at gmail.com>
> To: "Hal Finkel" <hfinkel at anl.gov>
> Cc: "LLVM Developers Mailing List" <llvmdev at cs.uiuc.edu>
> Sent: Tuesday, February 24, 2015 6:17:22 AM
> Subject: Re: [LLVMdev] Question about shouldMergeGEPs in
InstructionCombining
> 
> On Mon, Feb 23, 2015 at 2:17 PM, Hal Finkel < hfinkel at anl.gov >
> wrote:
> 
> 
> ----- Original Message -----
> > From: "Francois Pichet" < pichet2000 at gmail.com >
> > To: "LLVM Developers Mailing List" < llvmdev at
cs.uiuc.edu >
> > Sent: Sunday, February 22, 2015 5:34:11 PM
> > Subject: [LLVMdev] Question about shouldMergeGEPs in
> > InstructionCombining
> > 
> > Hello
> > 
> > I am not sure I understand the logic for merging GEPs in
> > InstructionCombining.cpp:
> > 
> > static bool shouldMergeGEPs (GEPOperator &GEP, GEPOperator
&Src) {
> > // If this GEP has only 0 indices, it is the same pointer as
> > // Src. If Src is not a trivial GEP too, don't combine
> > // the indices.
> > if (GEP.hasAllZeroIndices() && !Src.hasAllZeroIndices()
&&
> > !Src.hasOneUse())
> > return false;
> > return true;
> > }
> > 
> > 
> > 
> > I have a case where
> > GEP: %arrayidx7 = getelementptr inbounds i32* %arrayidx, i32 %shl6
> > Src: %arrayidx = getelementptr inbounds [4096 x i32]* @phasor_4096,
> > i32 0, i32 %shl2
> > 
> > 
> > GEP. hasAllZeroIndices() will return false and the merge will occur
> > Why do we want to combine these 2 getelementptr?
> > 
> > 
> > On my out of tree target, combining these 2 GetElementPtr create a
> > performance regression because since GEP is in a loop (Src is out
> > of
> > loop), GEP will lower to a more complicated address for a
> > subsequent
> > load. (the complicated address needs to be calculated over and over
> > in the loop)
> > 
> 
> There are a couple of issues here. One, InstCombine's job is the move
> the IR toward a reasonable canonical form. It is doing that here,
> and I think that's the right thing to do. However, the problem you
> point out is a general one. Can you please explain why the
> MachineLICM pass is not able to hoist the redundant parts of the
> addressing computation out of the loop? We might also want to adjust
> for this in CodeGenPrep (CGP already has addressing-mode aware GEP
> splitting logic, although not quite for this case).
> 
> 
> 
> 
> Hi Hal,
> 
> 
> MachineLICM is not able to hoist anything because the address mode is
> not loop invariant.
> 
> 
> Here is a reduction of the code I am talking about.
> 
> 
> 
> extern const unsigned phasor[4096];
> void test(unsigned* out , unsigned step_size)
> {
> unsigned big_step_size = step_size<<2;
> int *phasor_ptr_temp_1 = &phasor[big_step_size];
> for (int i = 0 ; i < 1020 ; i+=4)
> out[i] = phasor_ptr_temp_1[i<<step_size];
> }
> 
> 
> I am getting slightly better code on my target (Octasic's Opus) if I
> return false for shouldMergeGEPs.
> I just tried with ppc64 and x86-64 and I am also getting better code
> without the GEP merging in InstructionCombining. I am not sure what
> the solution is yet but I think we are too aggressive when merging
> GEPs in InstructionCombining.
> 
> 
> Here is the details for ppc64: http://pastie.org/9978022
> 
> 
First, thanks for posting this, it is quite useful. LLVM trunk's PowerPC
backend does a slightly better job now, but for an irrelevant reason, so to
stick with the code you generated, we have:

.LBB0_1:                                # %for.body
                                        # =>This Inner Loop Header: Depth=1
	slw 8, 5, 4
	ld 9, .LC1 at toc@l(7)
	addi 5, 5, 4
	add 8, 8, 6
	extsw 8, 8
	sldi 8, 8, 2
	lwzx 8, 9, 8
	addi 9, 3, 16
	stw 8, 0(3)
	mr 3, 9
	bdnz .LBB0_1

there are two things wrong here, first:

	ld 9, .LC1 at toc@l(7)

this load is loop invariant, and should be hoisted (but was not). But more to
your point, this:

	add 8, 8, 6

which represents this in your C code:

  int *phasor_ptr_temp_1 = &phasor[big_step_size];

has remained inside the loop. As you point out, the programmer even hoisted it
for us, and we sunk it into the loop ourselves. The question is really this: How
should we fix this? Generically, I believe that using fewer GEPs makes a cleaner
canonical form for the IR as it passes through the mid-level optimizers, and we
should split the GEPs to hoist invariant parts out of loops later in the
pipeline (in or near CodeGenPrep, LSR, etc.) where we normally do
addressing-mode-sensitive transformations.

 -Hal 

-- 
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory

Hal Finkel

2015-Feb-25 21:41 UTC

head link

[LLVMdev] Question about shouldMergeGEPs in InstructionCombining

----- Original Message -----> From: "Hal Finkel" <hfinkel at anl.gov>
> To: "Francois Pichet" <pichet2000 at gmail.com>
> Cc: "LLVM Developers Mailing List" <llvmdev at
cs.uiuc.edu>, "chandlerc" <chandlerc at gmail.com>
> Sent: Tuesday, February 24, 2015 11:27:43 PM
> Subject: Re: [LLVMdev] Question about shouldMergeGEPs in
InstructionCombining
> 
> w----- Original Message -----
> > From: "Francois Pichet" <pichet2000 at gmail.com>
> > To: "Hal Finkel" <hfinkel at anl.gov>
> > Cc: "LLVM Developers Mailing List" <llvmdev at
cs.uiuc.edu>
> > Sent: Tuesday, February 24, 2015 6:17:22 AM
> > Subject: Re: [LLVMdev] Question about shouldMergeGEPs in
> > InstructionCombining
> > 
> > On Mon, Feb 23, 2015 at 2:17 PM, Hal Finkel < hfinkel at anl.gov
>
> > wrote:
> > 
> > 
> > ----- Original Message -----
> > > From: "Francois Pichet" < pichet2000 at gmail.com
>
> > > To: "LLVM Developers Mailing List" < llvmdev at
cs.uiuc.edu >
> > > Sent: Sunday, February 22, 2015 5:34:11 PM
> > > Subject: [LLVMdev] Question about shouldMergeGEPs in
> > > InstructionCombining
> > > 
> > > Hello
> > > 
> > > I am not sure I understand the logic for merging GEPs in
> > > InstructionCombining.cpp:
> > > 
> > > static bool shouldMergeGEPs (GEPOperator &GEP, GEPOperator
&Src)
> > > {
> > > // If this GEP has only 0 indices, it is the same pointer as
> > > // Src. If Src is not a trivial GEP too, don't combine
> > > // the indices.
> > > if (GEP.hasAllZeroIndices() && !Src.hasAllZeroIndices()
&&
> > > !Src.hasOneUse())
> > > return false;
> > > return true;
> > > }
> > > 
> > > 
> > > 
> > > I have a case where
> > > GEP: %arrayidx7 = getelementptr inbounds i32* %arrayidx, i32
> > > %shl6
> > > Src: %arrayidx = getelementptr inbounds [4096 x i32]*
> > > @phasor_4096,
> > > i32 0, i32 %shl2
> > > 
> > > 
> > > GEP. hasAllZeroIndices() will return false and the merge will
> > > occur
> > > Why do we want to combine these 2 getelementptr?
> > > 
> > > 
> > > On my out of tree target, combining these 2 GetElementPtr create
> > > a
> > > performance regression because since GEP is in a loop (Src is out
> > > of
> > > loop), GEP will lower to a more complicated address for a
> > > subsequent
> > > load. (the complicated address needs to be calculated over and
> > > over
> > > in the loop)
> > > 
> > 
> > There are a couple of issues here. One, InstCombine's job is the
> > move
> > the IR toward a reasonable canonical form. It is doing that here,
> > and I think that's the right thing to do. However, the problem you
> > point out is a general one. Can you please explain why the
> > MachineLICM pass is not able to hoist the redundant parts of the
> > addressing computation out of the loop? We might also want to
> > adjust
> > for this in CodeGenPrep (CGP already has addressing-mode aware GEP
> > splitting logic, although not quite for this case).
> > 
> > 
> > 
> > 
> > Hi Hal,
> > 
> > 
> > MachineLICM is not able to hoist anything because the address mode
> > is
> > not loop invariant.
> > 
> > 
> > Here is a reduction of the code I am talking about.
> > 
> > 
> > 
> > extern const unsigned phasor[4096];
> > void test(unsigned* out , unsigned step_size)
> > {
> > unsigned big_step_size = step_size<<2;
> > int *phasor_ptr_temp_1 = &phasor[big_step_size];
> > for (int i = 0 ; i < 1020 ; i+=4)
> > out[i] = phasor_ptr_temp_1[i<<step_size];
> > }
> > 
> > 
> > I am getting slightly better code on my target (Octasic's Opus) if
> > I
> > return false for shouldMergeGEPs.
> > I just tried with ppc64 and x86-64 and I am also getting better
> > code
> > without the GEP merging in InstructionCombining. I am not sure what
> > the solution is yet but I think we are too aggressive when merging
> > GEPs in InstructionCombining.
> > 
> > 
> > Here is the details for ppc64: http://pastie.org/9978022
> > 
> > 
> 
> First, thanks for posting this, it is quite useful. LLVM trunk's
> PowerPC backend does a slightly better job now, but for an
> irrelevant reason, so to stick with the code you generated, we have:
> 
> .LBB0_1:                                # %for.body
>                                         # =>This Inner Loop Header:
>                                         Depth=1
> 	slw 8, 5, 4
> 	ld 9, .LC1 at toc@l(7)
> 	addi 5, 5, 4
> 	add 8, 8, 6
> 	extsw 8, 8
> 	sldi 8, 8, 2
> 	lwzx 8, 9, 8
> 	addi 9, 3, 16
> 	stw 8, 0(3)
> 	mr 3, 9
> 	bdnz .LBB0_1
> 
> there are two things wrong here, first:
> 
> 	ld 9, .LC1 at toc@l(7)
> 
> this load is loop invariant, and should be hoisted (but was not).
The problem of the non-hoisted TOC load was a backend deficiency, now fixed in
r230553. Thanks for providing this example. I'll now give some more thought
to the GEP splitting issue.

 -Hal
> But
> more to your point, this:
> 
> 	add 8, 8, 6
> 
> which represents this in your C code:
> 
>   int *phasor_ptr_temp_1 = &phasor[big_step_size];
> 
> has remained inside the loop. As you point out, the programmer even
> hoisted it for us, and we sunk it into the loop ourselves. The
> question is really this: How should we fix this? Generically, I
> believe that using fewer GEPs makes a cleaner canonical form for the
> IR as it passes through the mid-level optimizers, and we should
> split the GEPs to hoist invariant parts out of loops later in the
> pipeline (in or near CodeGenPrep, LSR, etc.) where we normally do
> addressing-mode-sensitive transformations.
> 
>  -Hal
> 
> --
> Hal Finkel
> Assistant Computational Scientist
> Leadership Computing Facility
> Argonne National Laboratory
> 
-- 
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory

Seemingly Similar Threads

Search for more possibly parallel threads

llvm dev - Feb 2015 - [LLVMdev] Question about shouldMergeGEPs in InstructionCombining

[LLVMdev] Question about shouldMergeGEPs in InstructionCombining

[LLVMdev] Question about shouldMergeGEPs in InstructionCombining

[LLVMdev] Question about shouldMergeGEPs in InstructionCombining

Seemingly Similar Threads