Francois Pichet
2015-Feb-24 12:17 UTC
[LLVMdev] Question about shouldMergeGEPs in InstructionCombining
On Mon, Feb 23, 2015 at 2:17 PM, Hal Finkel <hfinkel at anl.gov> wrote:> ----- Original Message ----- > > From: "Francois Pichet" <pichet2000 at gmail.com> > > To: "LLVM Developers Mailing List" <llvmdev at cs.uiuc.edu> > > Sent: Sunday, February 22, 2015 5:34:11 PM > > Subject: [LLVMdev] Question about shouldMergeGEPs in InstructionCombining > > > > Hello > > > > I am not sure I understand the logic for merging GEPs in > > InstructionCombining.cpp: > > > > static bool shouldMergeGEPs (GEPOperator &GEP, GEPOperator &Src) { > > // If this GEP has only 0 indices, it is the same pointer as > > // Src. If Src is not a trivial GEP too, don't combine > > // the indices. > > if (GEP.hasAllZeroIndices() && !Src.hasAllZeroIndices() && > > !Src.hasOneUse()) > > return false; > > return true; > > } > > > > > > > > I have a case where > > GEP: %arrayidx7 = getelementptr inbounds i32* %arrayidx, i32 %shl6 > > Src: %arrayidx = getelementptr inbounds [4096 x i32]* @phasor_4096, > > i32 0, i32 %shl2 > > > > > > GEP. hasAllZeroIndices() will return false and the merge will occur > > Why do we want to combine these 2 getelementptr? > > > > > > On my out of tree target, combining these 2 GetElementPtr create a > > performance regression because since GEP is in a loop (Src is out of > > loop), GEP will lower to a more complicated address for a subsequent > > load. (the complicated address needs to be calculated over and over > > in the loop) > > > > There are a couple of issues here. One, InstCombine's job is the move the > IR toward a reasonable canonical form. It is doing that here, and I think > that's the right thing to do. However, the problem you point out is a > general one. Can you please explain why the MachineLICM pass is not able to > hoist the redundant parts of the addressing computation out of the loop? We > might also want to adjust for this in CodeGenPrep (CGP already has > addressing-mode aware GEP splitting logic, although not quite for this > case). > >Hi Hal, MachineLICM is not able to hoist anything because the address mode is not loop invariant. Here is a reduction of the code I am talking about. extern const unsigned phasor[4096]; void test(unsigned* out , unsigned step_size) { unsigned big_step_size = step_size<<2; int *phasor_ptr_temp_1 = &phasor[big_step_size]; for (int i = 0 ; i < 1020 ; i+=4) out[i] = phasor_ptr_temp_1[i<<step_size]; } I am getting slightly better code on my target (Octasic's Opus) if I return false for shouldMergeGEPs. I just tried with ppc64 and x86-64 and I am also getting better code without the GEP merging in InstructionCombining. I am not sure what the solution is yet but I think we are too aggressive when merging GEPs in InstructionCombining. Here is the details for ppc64: http://pastie.org/9978022 -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150224/606b7b39/attachment.html>
Hal Finkel
2015-Feb-25 05:27 UTC
[LLVMdev] Question about shouldMergeGEPs in InstructionCombining
w----- Original Message -----> From: "Francois Pichet" <pichet2000 at gmail.com> > To: "Hal Finkel" <hfinkel at anl.gov> > Cc: "LLVM Developers Mailing List" <llvmdev at cs.uiuc.edu> > Sent: Tuesday, February 24, 2015 6:17:22 AM > Subject: Re: [LLVMdev] Question about shouldMergeGEPs in InstructionCombining > > On Mon, Feb 23, 2015 at 2:17 PM, Hal Finkel < hfinkel at anl.gov > > wrote: > > > ----- Original Message ----- > > From: "Francois Pichet" < pichet2000 at gmail.com > > > To: "LLVM Developers Mailing List" < llvmdev at cs.uiuc.edu > > > Sent: Sunday, February 22, 2015 5:34:11 PM > > Subject: [LLVMdev] Question about shouldMergeGEPs in > > InstructionCombining > > > > Hello > > > > I am not sure I understand the logic for merging GEPs in > > InstructionCombining.cpp: > > > > static bool shouldMergeGEPs (GEPOperator &GEP, GEPOperator &Src) { > > // If this GEP has only 0 indices, it is the same pointer as > > // Src. If Src is not a trivial GEP too, don't combine > > // the indices. > > if (GEP.hasAllZeroIndices() && !Src.hasAllZeroIndices() && > > !Src.hasOneUse()) > > return false; > > return true; > > } > > > > > > > > I have a case where > > GEP: %arrayidx7 = getelementptr inbounds i32* %arrayidx, i32 %shl6 > > Src: %arrayidx = getelementptr inbounds [4096 x i32]* @phasor_4096, > > i32 0, i32 %shl2 > > > > > > GEP. hasAllZeroIndices() will return false and the merge will occur > > Why do we want to combine these 2 getelementptr? > > > > > > On my out of tree target, combining these 2 GetElementPtr create a > > performance regression because since GEP is in a loop (Src is out > > of > > loop), GEP will lower to a more complicated address for a > > subsequent > > load. (the complicated address needs to be calculated over and over > > in the loop) > > > > There are a couple of issues here. One, InstCombine's job is the move > the IR toward a reasonable canonical form. It is doing that here, > and I think that's the right thing to do. However, the problem you > point out is a general one. Can you please explain why the > MachineLICM pass is not able to hoist the redundant parts of the > addressing computation out of the loop? We might also want to adjust > for this in CodeGenPrep (CGP already has addressing-mode aware GEP > splitting logic, although not quite for this case). > > > > > Hi Hal, > > > MachineLICM is not able to hoist anything because the address mode is > not loop invariant. > > > Here is a reduction of the code I am talking about. > > > > extern const unsigned phasor[4096]; > void test(unsigned* out , unsigned step_size) > { > unsigned big_step_size = step_size<<2; > int *phasor_ptr_temp_1 = &phasor[big_step_size]; > for (int i = 0 ; i < 1020 ; i+=4) > out[i] = phasor_ptr_temp_1[i<<step_size]; > } > > > I am getting slightly better code on my target (Octasic's Opus) if I > return false for shouldMergeGEPs. > I just tried with ppc64 and x86-64 and I am also getting better code > without the GEP merging in InstructionCombining. I am not sure what > the solution is yet but I think we are too aggressive when merging > GEPs in InstructionCombining. > > > Here is the details for ppc64: http://pastie.org/9978022 > >First, thanks for posting this, it is quite useful. LLVM trunk's PowerPC backend does a slightly better job now, but for an irrelevant reason, so to stick with the code you generated, we have: .LBB0_1: # %for.body # =>This Inner Loop Header: Depth=1 slw 8, 5, 4 ld 9, .LC1 at toc@l(7) addi 5, 5, 4 add 8, 8, 6 extsw 8, 8 sldi 8, 8, 2 lwzx 8, 9, 8 addi 9, 3, 16 stw 8, 0(3) mr 3, 9 bdnz .LBB0_1 there are two things wrong here, first: ld 9, .LC1 at toc@l(7) this load is loop invariant, and should be hoisted (but was not). But more to your point, this: add 8, 8, 6 which represents this in your C code: int *phasor_ptr_temp_1 = &phasor[big_step_size]; has remained inside the loop. As you point out, the programmer even hoisted it for us, and we sunk it into the loop ourselves. The question is really this: How should we fix this? Generically, I believe that using fewer GEPs makes a cleaner canonical form for the IR as it passes through the mid-level optimizers, and we should split the GEPs to hoist invariant parts out of loops later in the pipeline (in or near CodeGenPrep, LSR, etc.) where we normally do addressing-mode-sensitive transformations. -Hal -- Hal Finkel Assistant Computational Scientist Leadership Computing Facility Argonne National Laboratory
Hal Finkel
2015-Feb-25 21:41 UTC
[LLVMdev] Question about shouldMergeGEPs in InstructionCombining
----- Original Message -----> From: "Hal Finkel" <hfinkel at anl.gov> > To: "Francois Pichet" <pichet2000 at gmail.com> > Cc: "LLVM Developers Mailing List" <llvmdev at cs.uiuc.edu>, "chandlerc" <chandlerc at gmail.com> > Sent: Tuesday, February 24, 2015 11:27:43 PM > Subject: Re: [LLVMdev] Question about shouldMergeGEPs in InstructionCombining > > w----- Original Message ----- > > From: "Francois Pichet" <pichet2000 at gmail.com> > > To: "Hal Finkel" <hfinkel at anl.gov> > > Cc: "LLVM Developers Mailing List" <llvmdev at cs.uiuc.edu> > > Sent: Tuesday, February 24, 2015 6:17:22 AM > > Subject: Re: [LLVMdev] Question about shouldMergeGEPs in > > InstructionCombining > > > > On Mon, Feb 23, 2015 at 2:17 PM, Hal Finkel < hfinkel at anl.gov > > > wrote: > > > > > > ----- Original Message ----- > > > From: "Francois Pichet" < pichet2000 at gmail.com > > > > To: "LLVM Developers Mailing List" < llvmdev at cs.uiuc.edu > > > > Sent: Sunday, February 22, 2015 5:34:11 PM > > > Subject: [LLVMdev] Question about shouldMergeGEPs in > > > InstructionCombining > > > > > > Hello > > > > > > I am not sure I understand the logic for merging GEPs in > > > InstructionCombining.cpp: > > > > > > static bool shouldMergeGEPs (GEPOperator &GEP, GEPOperator &Src) > > > { > > > // If this GEP has only 0 indices, it is the same pointer as > > > // Src. If Src is not a trivial GEP too, don't combine > > > // the indices. > > > if (GEP.hasAllZeroIndices() && !Src.hasAllZeroIndices() && > > > !Src.hasOneUse()) > > > return false; > > > return true; > > > } > > > > > > > > > > > > I have a case where > > > GEP: %arrayidx7 = getelementptr inbounds i32* %arrayidx, i32 > > > %shl6 > > > Src: %arrayidx = getelementptr inbounds [4096 x i32]* > > > @phasor_4096, > > > i32 0, i32 %shl2 > > > > > > > > > GEP. hasAllZeroIndices() will return false and the merge will > > > occur > > > Why do we want to combine these 2 getelementptr? > > > > > > > > > On my out of tree target, combining these 2 GetElementPtr create > > > a > > > performance regression because since GEP is in a loop (Src is out > > > of > > > loop), GEP will lower to a more complicated address for a > > > subsequent > > > load. (the complicated address needs to be calculated over and > > > over > > > in the loop) > > > > > > > There are a couple of issues here. One, InstCombine's job is the > > move > > the IR toward a reasonable canonical form. It is doing that here, > > and I think that's the right thing to do. However, the problem you > > point out is a general one. Can you please explain why the > > MachineLICM pass is not able to hoist the redundant parts of the > > addressing computation out of the loop? We might also want to > > adjust > > for this in CodeGenPrep (CGP already has addressing-mode aware GEP > > splitting logic, although not quite for this case). > > > > > > > > > > Hi Hal, > > > > > > MachineLICM is not able to hoist anything because the address mode > > is > > not loop invariant. > > > > > > Here is a reduction of the code I am talking about. > > > > > > > > extern const unsigned phasor[4096]; > > void test(unsigned* out , unsigned step_size) > > { > > unsigned big_step_size = step_size<<2; > > int *phasor_ptr_temp_1 = &phasor[big_step_size]; > > for (int i = 0 ; i < 1020 ; i+=4) > > out[i] = phasor_ptr_temp_1[i<<step_size]; > > } > > > > > > I am getting slightly better code on my target (Octasic's Opus) if > > I > > return false for shouldMergeGEPs. > > I just tried with ppc64 and x86-64 and I am also getting better > > code > > without the GEP merging in InstructionCombining. I am not sure what > > the solution is yet but I think we are too aggressive when merging > > GEPs in InstructionCombining. > > > > > > Here is the details for ppc64: http://pastie.org/9978022 > > > > > > First, thanks for posting this, it is quite useful. LLVM trunk's > PowerPC backend does a slightly better job now, but for an > irrelevant reason, so to stick with the code you generated, we have: > > .LBB0_1: # %for.body > # =>This Inner Loop Header: > Depth=1 > slw 8, 5, 4 > ld 9, .LC1 at toc@l(7) > addi 5, 5, 4 > add 8, 8, 6 > extsw 8, 8 > sldi 8, 8, 2 > lwzx 8, 9, 8 > addi 9, 3, 16 > stw 8, 0(3) > mr 3, 9 > bdnz .LBB0_1 > > there are two things wrong here, first: > > ld 9, .LC1 at toc@l(7) > > this load is loop invariant, and should be hoisted (but was not).The problem of the non-hoisted TOC load was a backend deficiency, now fixed in r230553. Thanks for providing this example. I'll now give some more thought to the GEP splitting issue. -Hal> But > more to your point, this: > > add 8, 8, 6 > > which represents this in your C code: > > int *phasor_ptr_temp_1 = &phasor[big_step_size]; > > has remained inside the loop. As you point out, the programmer even > hoisted it for us, and we sunk it into the loop ourselves. The > question is really this: How should we fix this? Generically, I > believe that using fewer GEPs makes a cleaner canonical form for the > IR as it passes through the mid-level optimizers, and we should > split the GEPs to hoist invariant parts out of loops later in the > pipeline (in or near CodeGenPrep, LSR, etc.) where we normally do > addressing-mode-sensitive transformations. > > -Hal > > -- > Hal Finkel > Assistant Computational Scientist > Leadership Computing Facility > Argonne National Laboratory >-- Hal Finkel Assistant Computational Scientist Leadership Computing Facility Argonne National Laboratory