Sanjoy Das via llvm-dev
2015-Aug-17 02:43 UTC
[llvm-dev] RFC for a design change in LoopStrengthReduce / ScalarEvolution
> I don't understand why you want to factor out the information, > exactly. It seems like what you need is a function like: > > unsigned getMinLeadingZeros(const SCEV *); > > then, if you want to get the non-extended expression, you can just > apply an appropriate truncation. I assume, however, that I'm missing > something.The problem is not about how to codegen a specific SCEV expression efficiently, but about the overall solution LSR converges to (which includes considering how it will codegen *other* SCEVs). In the latter sense there is a difference between `X` and `(trunc (zext X))` even though both of them produce the same value (and possibly are equally efficient). The example in [1] is roughly like this at the high level: You have an i32 induction variable `V` == `{VStart,+,1}`. `{VStart,+,1}` does not unsigned-overflow given the loop bounds. Two interesting uses in the loop are something like (in this mail by zext I'll always mean zext from i32 to i64) `GEP @Global, zext(V)` and `V ult limit`. Because SCEV can prove that `{VStart,+,1}` does not unsigned-overflow, `GEP @Global, zext(V)` can be computed as `GEP (@Global + zext VStart), {i64 0,+,1}`. Similarly, `V` == `trunc (zext V)` =`trunc {zext VStart,+,1}` == `trunc {zext VStart,+,1}` =`trunc({i64 0,+,1}) + trunc (zext VStart)` =`trunc({i64 0,+,1}) + VStart`. This is the solution LSR ends up choosing. However, the cheaper solution here is `GEP @Global, zext(V)` == `GEP @Global, zext({VStart,+,1})` and `V ult limit` == `{VStart,+,1} ult limit` (i.e. pretty much the direct naive lowering). This naive lowering is better because we can use the increment in `{VStart,+,1}` to zero extend the value implicitly on x86 (you can find the generated machine code at [1]). The reason why LSR does not pick this cheaper solution is because it "forgets" that `{zext VStart,+,1}` == `zext({VStart,+,1})`. I'm trying to fix this by adding a rule that replaces `{zext X,+,zext Y}` with `zext({X,+,Y})` when possible on architectures that have "free" zero extensions, so that if there is a solution involving the narrower induction variable, it will be found. Now I could use the trick you mentioned in the place in LSR where it computes the cost of a solution (i.e. when looking for "common" registers see if one register is the zero extension of another). I have not tried it, but I suspect it will be fairly tricky -- a given solution will now have more than one "cost" depending on how you interpret specific registers (as truncations of some extended value or as a native wide value) and you'd have to search over the possible interpretations of a specific solution.> Adding simplification barriers into SCEV expressions seems like a > dangerous idea, especially since LSR is not the only user of SCEV.So I don't intend for SCEV to produce SCEVOpaqueExpr nodes itself. There would instead be a 'const SCEV *getOpaqueExpr(const SCEV *)' function that clients of SCEV can use to create an opaque expression. Then most of the simplification routines SCEV will be changed to have the moral equivalent of 'if (isa<SCEVOpaqueExpr>(E)) continue;'. -- Sanjoy [1]: http://lists.llvm.org/pipermail/llvm-dev/2015-April/084434.html
Hal Finkel via llvm-dev
2015-Aug-17 04:10 UTC
[llvm-dev] RFC for a design change in LoopStrengthReduce / ScalarEvolution
----- Original Message -----> From: "Sanjoy Das" <sanjoy at playingwithpointers.com> > To: "Hal Finkel" <hfinkel at anl.gov> > Cc: "llvm-dev" <llvm-dev at lists.llvm.org>, "Andrew Trick" <atrick at apple.com>, "Quentin Colombet" > <qcolombet at apple.com>, "Dan Gohman" <dan433584 at gmail.com> > Sent: Sunday, August 16, 2015 9:43:54 PM > Subject: Re: [llvm-dev] RFC for a design change in LoopStrengthReduce / ScalarEvolution > > > I don't understand why you want to factor out the information, > > exactly. It seems like what you need is a function like: > > > > unsigned getMinLeadingZeros(const SCEV *); > > > > then, if you want to get the non-extended expression, you can just > > apply an appropriate truncation. I assume, however, that I'm > > missing > > something. > > The problem is not about how to codegen a specific SCEV expression > efficiently, but about the overall solution LSR converges to (which > includes considering how it will codegen *other* SCEVs). In the > latter sense there is a difference between `X` and `(trunc (zext X))` > even though both of them produce the same value (and possibly are > equally efficient). > > The example in [1] is roughly like this at the high level: > > You have an i32 induction variable `V` == `{VStart,+,1}`. > `{VStart,+,1}` does not unsigned-overflow given the loop bounds. Two > interesting uses in the loop are something like (in this mail by zext > I'll always mean zext from i32 to i64) `GEP @Global, zext(V)` and > `V ult limit`. > > Because SCEV can prove that `{VStart,+,1}` does not > unsigned-overflow, > `GEP @Global, zext(V)` can be computed as `GEP (@Global + zext > VStart), {i64 0,+,1}`. Similarly, `V` == `trunc (zext V)` => `trunc {zext VStart,+,1}` == `trunc {zext VStart,+,1}` => `trunc({i64 0,+,1}) + trunc (zext VStart)` => `trunc({i64 0,+,1}) + VStart`. This is the solution LSR ends up > choosing. However, the cheaper solution here is > > `GEP @Global, zext(V)` == `GEP @Global, zext({VStart,+,1})` > and > `V ult limit` == `{VStart,+,1} ult limit` > > (i.e. pretty much the direct naive lowering). > > This naive lowering is better because we can use the increment in > `{VStart,+,1}` to zero extend the value implicitly on x86 (you can > find the generated machine code at [1]). > > The reason why LSR does not pick this cheaper solution is because it > "forgets" that `{zext VStart,+,1}` == `zext({VStart,+,1})`. I'm > trying to fix this by adding a rule that replaces `{zext X,+,zext Y}` > with `zext({X,+,Y})` when possible on architectures that have "free" > zero extensions, so that if there is a solution involving the > narrower > induction variable, it will be found.To back up for a second, how much of this is self-inflicted damage? IndVarSimplify likes to preemptively widen induction variables. Is that why you have the extensions here in the first place? The code model used by IndVarSimplify is a bit simple: // We should not widen an indvar if arithmetics on the wider indvar are more // expensive than those on the narrower indvar. We check only the cost of ADD // because at least an ADD is required to increment the induction variable. We // could compute more comprehensively the cost of all instructions on the // induction variable when necessary. if (TTI && TTI->getArithmeticInstrCost(Instruction::Add, Ty) > TTI->getArithmeticInstrCost(Instruction::Add, Cast->getOperand(0)->getType())) { and perhaps we need a bit more sophistication here. I've found that, generally speaking, the extensions on inductions variables, and SCEV's preference for factoring them out to varying degrees, ends up making SCEV's job more difficult. Also, Clang likes to generate code that ends up looking like this (after some optimization): %idxprom = sext i32 %i.0 to i64 %arrayidx = getelementptr inbounds i32, i32* %b, i64 %idxprom but even here, this is actually unnecessary. An inbounds GEP is defined to use sign-extended arithmetic, and so this could just as easily be: %arrayidx = getelementptr inbounds i32, i32* %b, i32 %i.0 In short, I wonder if we're solving the right problem. Maybe the canonicalization of the IR is not as useful as it could be. -Hal> > Now I could use the trick you mentioned in the place in LSR where it > computes the cost of a solution (i.e. when looking for "common" > registers see if one register is the zero extension of another). I > have not tried it, but I suspect it will be fairly tricky -- a given > solution will now have more than one "cost" depending on how you > interpret specific registers (as truncations of some extended value > or > as a native wide value) and you'd have to search over the possible > interpretations of a specific solution. > > > Adding simplification barriers into SCEV expressions seems like a > > dangerous idea, especially since LSR is not the only user of SCEV. > > So I don't intend for SCEV to produce SCEVOpaqueExpr nodes itself. > There would instead be a 'const SCEV *getOpaqueExpr(const SCEV *)' > function that clients of SCEV can use to create an opaque expression. > Then most of the simplification routines SCEV will be changed to have > the moral equivalent of 'if (isa<SCEVOpaqueExpr>(E)) continue;'. > > -- Sanjoy > > [1]: http://lists.llvm.org/pipermail/llvm-dev/2015-April/084434.html >-- Hal Finkel Assistant Computational Scientist Leadership Computing Facility Argonne National Laboratory
Sanjoy Das via llvm-dev
2015-Aug-17 21:38 UTC
[llvm-dev] RFC for a design change in LoopStrengthReduce / ScalarEvolution
> To back up for a second, how much of this is self-inflicted damage? > IndVarSimplify likes to preemptively widen induction variables. Is > that why you have the extensions here in the first place?In the specific example I was talking about the zext came from our frontend (our FE used to insert these extensions for reasons that are no longer relevant). But you can easily get the same behavior from an expression like `a[(unsigned long)i]`. This looked like a fairly straightforward LSR problem to me, especially because SCEV already has all the smarts to infer the required properties. The only thing missing the right framing of the problem (i.e. framing it in a way such that we can implement a maintainable solution). -- Sanjoy
Steve King via llvm-dev
2015-Aug-18 00:35 UTC
[llvm-dev] RFC for a design change in LoopStrengthReduce / ScalarEvolution
On Sun, Aug 16, 2015 at 7:43 PM, Sanjoy Das via llvm-dev <llvm-dev at lists.llvm.org> wrote:> The problem is not about how to codegen a specific SCEV expression > efficiently, but about the overall solution LSR converges to (which > includes considering how it will codegen *other* SCEVs). In the > latter sense there is a difference between `X` and `(trunc (zext X))` > even though both of them produce the same value (and possibly are > equally efficient). > > The example in [1] is roughly like this at the high level: > > You have an i32 induction variable `V` == `{VStart,+,1}`. > `{VStart,+,1}` does not unsigned-overflow given the loop bounds. Two > interesting uses in the loop are something like (in this mail by zext > I'll always mean zext from i32 to i64) `GEP @Global, zext(V)` and > `V ult limit`. > > Because SCEV can prove that `{VStart,+,1}` does not unsigned-overflow, > `GEP @Global, zext(V)` can be computed as `GEP (@Global + zext > VStart), {i64 0,+,1}`. Similarly, `V` == `trunc (zext V)` => `trunc {zext VStart,+,1}` == `trunc {zext VStart,+,1}` => `trunc({i64 0,+,1}) + trunc (zext VStart)` => `trunc({i64 0,+,1}) + VStart`. This is the solution LSR ends up > choosing. However, the cheaper solution here is > > `GEP @Global, zext(V)` == `GEP @Global, zext({VStart,+,1})` > and > `V ult limit` == `{VStart,+,1} ult limit` > > (i.e. pretty much the direct naive lowering). > > This naive lowering is better because we can use the increment in > `{VStart,+,1}` to zero extend the value implicitly on x86 (you can > find the generated machine code at [1]). > > The reason why LSR does not pick this cheaper solution is because it > "forgets" that `{zext VStart,+,1}` == `zext({VStart,+,1})`. I'm > trying to fix this by adding a rule that replaces `{zext X,+,zext Y}` > with `zext({X,+,Y})` when possible on architectures that have "free" > zero extensions, so that if there is a solution involving the narrower > induction variable, it will be found. > > Now I could use the trick you mentioned in the place in LSR where it > computes the cost of a solution (i.e. when looking for "common" > registers see if one register is the zero extension of another). I > have not tried it, but I suspect it will be fairly tricky -- a given > solution will now have more than one "cost" depending on how you > interpret specific registers (as truncations of some extended value or > as a native wide value) and you'd have to search over the possible > interpretations of a specific solution. > >> Adding simplification barriers into SCEV expressions seems like a >> dangerous idea, especially since LSR is not the only user of SCEV. > > So I don't intend for SCEV to produce SCEVOpaqueExpr nodes itself. > There would instead be a 'const SCEV *getOpaqueExpr(const SCEV *)' > function that clients of SCEV can use to create an opaque expression. > Then most of the simplification routines SCEV will be changed to have > the moral equivalent of 'if (isa<SCEVOpaqueExpr>(E)) continue;'.Hi Sanjoy - I also noticed LSR was converging on suboptimal solutions. I'm hoping you can clarify this discussion for me. Given a simple copy loop: void copy(int* src, int* dst, unsigned int size) { for (unsigned int i = 0; i < size; ++i) dst[i] = src[i]; } The hot spot for i686 contains 6 instructions with LSR: 8b 32 mov (%edx),%esi 89 31 mov %esi,(%ecx) 83 c1 04 add $0x4,%ecx 83 c2 04 add $0x4,%edx 48 dec %eax 75 f3 jne 11 <copy+0x11> but only 5 instructions without LSR. 8b 3c b2 mov (%edx,%esi,4),%edi 89 3c b1 mov %edi,(%ecx,%esi,4) 46 inc %esi 39 c6 cmp %eax,%esi 75 f5 jne 14 <copy+0x14> My interpretation of this problem was that LSR starts analysis with an SCEV that counts down toward zero even in an upward counting loop. A more natural fit would be an initial IV going from 0 to some limit, but there does not seem to be a way to express this sort of SCEV. The initial condition puts an artificially high cost on solutions with upward counting registers since none match the "free" down counting SCEV. Notice the weird 'dec %eax' showing up as an artifact. The best solution with base-index scale 4 addressing was considered by LSR, but incorrectly deemed to be too expensive. Is this the same LSR problem you describe? Thanks, -steve
Sanjoy Das via llvm-dev
2015-Aug-18 05:39 UTC
[llvm-dev] RFC for a design change in LoopStrengthReduce / ScalarEvolution
> My interpretation of this problem was that LSR starts analysis with an > SCEV that counts down toward zero even in an upward counting loop. A > more natural fit would be an initial IV going from 0 to some limit, > but there does not seem to be a way to express this sort of SCEV. The > initial condition puts an artificially high cost on solutions with > upward counting registers since none match the "free" down counting > SCEV. Notice the weird 'dec %eax' showing up as an artifact. The > best solution with base-index scale 4 addressing was considered by > LSR, but incorrectly deemed to be too expensive. > > Is this the same LSR problem you describe?No, these two issues look unrelated. -- Sanjoy