thr3ads.net - llvm dev - [llvm-dev] RFC for a design change in LoopStrengthReduce / ScalarEvolution [Aug 2015]

If this information is useful, please help other people find it:
Share via:

Sanjoy Das via llvm-dev

2015-Aug-17 02:43 UTC

[llvm-dev] RFC for a design change in LoopStrengthReduce / ScalarEvolution

> I don't understand why you want to factor out the information,
> exactly. It seems like what you need is a function like:
>
>   unsigned getMinLeadingZeros(const SCEV *);
>
> then, if you want to get the non-extended expression, you can just
> apply an appropriate truncation. I assume, however, that I'm missing
> something.
The problem is not about how to codegen a specific SCEV expression
efficiently, but about the overall solution LSR converges to (which
includes considering how it will codegen *other* SCEVs).  In the
latter sense there is a difference between `X` and `(trunc (zext X))`
even though both of them produce the same value (and possibly are
equally efficient).

The example in [1] is roughly like this at the high level:

You have an i32 induction variable `V` == `{VStart,+,1}`.
`{VStart,+,1}` does not unsigned-overflow given the loop bounds.  Two
interesting uses in the loop are something like (in this mail by zext
I'll always mean zext from i32 to i64) `GEP @Global, zext(V)` and
`V ult limit`.

Because SCEV can prove that `{VStart,+,1}` does not unsigned-overflow,
`GEP @Global, zext(V)` can be computed as `GEP (@Global + zext
VStart), {i64 0,+,1}`.  Similarly, `V` == `trunc (zext V)` =`trunc {zext
VStart,+,1}` == `trunc {zext VStart,+,1}` =`trunc({i64 0,+,1}) + trunc (zext
VStart)` =`trunc({i64 0,+,1}) + VStart`.  This is the solution LSR ends up
choosing.  However, the cheaper solution here is

`GEP @Global, zext(V)` == `GEP @Global, zext({VStart,+,1})`
and
`V ult limit` == `{VStart,+,1} ult limit`

(i.e. pretty much the direct naive lowering).

This naive lowering is better because we can use the increment in
`{VStart,+,1}` to zero extend the value implicitly on x86 (you can
find the generated machine code at [1]).

The reason why LSR does not pick this cheaper solution is because it
"forgets" that `{zext VStart,+,1}` == `zext({VStart,+,1})`.  I'm
trying to fix this by adding a rule that replaces `{zext X,+,zext Y}`
with `zext({X,+,Y})` when possible on architectures that have "free"
zero extensions, so that if there is a solution involving the narrower
induction variable, it will be found.

Now I could use the trick you mentioned in the place in LSR where it
computes the cost of a solution (i.e. when looking for "common"
registers see if one register is the zero extension of another).  I
have not tried it, but I suspect it will be fairly tricky -- a given
solution will now have more than one "cost" depending on how you
interpret specific registers (as truncations of some extended value or
as a native wide value) and you'd have to search over the possible
interpretations of a specific solution.
> Adding simplification barriers into SCEV expressions seems like a
> dangerous idea, especially since LSR is not the only user of SCEV.
So I don't intend for SCEV to produce SCEVOpaqueExpr nodes itself.
There would instead be a 'const SCEV *getOpaqueExpr(const SCEV *)'
function that clients of SCEV can use to create an opaque expression.
Then most of the simplification routines SCEV will be changed to have
the moral equivalent of 'if (isa<SCEVOpaqueExpr>(E)) continue;'.

-- Sanjoy

[1]: http://lists.llvm.org/pipermail/llvm-dev/2015-April/084434.html

Hal Finkel via llvm-dev

2015-Aug-17 04:10 UTC

head link

[llvm-dev] RFC for a design change in LoopStrengthReduce / ScalarEvolution

----- Original Message -----> From: "Sanjoy Das" <sanjoy at playingwithpointers.com>
> To: "Hal Finkel" <hfinkel at anl.gov>
> Cc: "llvm-dev" <llvm-dev at lists.llvm.org>, "Andrew
Trick" <atrick at apple.com>, "Quentin Colombet"
> <qcolombet at apple.com>, "Dan Gohman" <dan433584 at
gmail.com>
> Sent: Sunday, August 16, 2015 9:43:54 PM
> Subject: Re: [llvm-dev] RFC for a design change in LoopStrengthReduce /
ScalarEvolution
> 
> > I don't understand why you want to factor out the information,
> > exactly. It seems like what you need is a function like:
> >
> >   unsigned getMinLeadingZeros(const SCEV *);
> >
> > then, if you want to get the non-extended expression, you can just
> > apply an appropriate truncation. I assume, however, that I'm
> > missing
> > something.
> 
> The problem is not about how to codegen a specific SCEV expression
> efficiently, but about the overall solution LSR converges to (which
> includes considering how it will codegen *other* SCEVs).  In the
> latter sense there is a difference between `X` and `(trunc (zext X))`
> even though both of them produce the same value (and possibly are
> equally efficient).
> 
> The example in [1] is roughly like this at the high level:
> 
> You have an i32 induction variable `V` == `{VStart,+,1}`.
> `{VStart,+,1}` does not unsigned-overflow given the loop bounds.  Two
> interesting uses in the loop are something like (in this mail by zext
> I'll always mean zext from i32 to i64) `GEP @Global, zext(V)` and
> `V ult limit`.
> 
> Because SCEV can prove that `{VStart,+,1}` does not
> unsigned-overflow,
> `GEP @Global, zext(V)` can be computed as `GEP (@Global + zext
> VStart), {i64 0,+,1}`.  Similarly, `V` == `trunc (zext V)` => `trunc
{zext VStart,+,1}` == `trunc {zext VStart,+,1}` => `trunc({i64 0,+,1}) +
trunc (zext VStart)` => `trunc({i64 0,+,1}) + VStart`.  This is the solution
LSR ends up
> choosing.  However, the cheaper solution here is
> 
> `GEP @Global, zext(V)` == `GEP @Global, zext({VStart,+,1})`
> and
> `V ult limit` == `{VStart,+,1} ult limit`
> 
> (i.e. pretty much the direct naive lowering).
> 
> This naive lowering is better because we can use the increment in
> `{VStart,+,1}` to zero extend the value implicitly on x86 (you can
> find the generated machine code at [1]).
> 
> The reason why LSR does not pick this cheaper solution is because it
> "forgets" that `{zext VStart,+,1}` == `zext({VStart,+,1})`. 
I'm
> trying to fix this by adding a rule that replaces `{zext X,+,zext Y}`
> with `zext({X,+,Y})` when possible on architectures that have
"free"
> zero extensions, so that if there is a solution involving the
> narrower
> induction variable, it will be found.
To back up for a second, how much of this is self-inflicted damage?
IndVarSimplify likes to preemptively widen induction variables. Is that why you
have the extensions here in the first place?

The code model used by IndVarSimplify is a bit simple:

  // We should not widen an indvar if arithmetics on the wider indvar are more
  // expensive than those on the narrower indvar. We check only the cost of ADD
  // because at least an ADD is required to increment the induction variable. We
  // could compute more comprehensively the cost of all instructions on the
  // induction variable when necessary.
  if (TTI &&
      TTI->getArithmeticInstrCost(Instruction::Add, Ty) >
          TTI->getArithmeticInstrCost(Instruction::Add,
                                      Cast->getOperand(0)->getType())) {

and perhaps we need a bit more sophistication here. I've found that,
generally speaking, the extensions on inductions variables, and SCEV's
preference for factoring them out to varying degrees, ends up making SCEV's
job more difficult.

Also, Clang likes to generate code that ends up looking like this (after some
optimization):

  %idxprom = sext i32 %i.0 to i64
  %arrayidx = getelementptr inbounds i32, i32* %b, i64 %idxprom

but even here, this is actually unnecessary. An inbounds GEP is defined to use
sign-extended arithmetic, and so this could just as easily be:

  %arrayidx = getelementptr inbounds i32, i32* %b, i32 %i.0

In short, I wonder if we're solving the right problem. Maybe the
canonicalization of the IR is not as useful as it could be.

 -Hal
> 
> Now I could use the trick you mentioned in the place in LSR where it
> computes the cost of a solution (i.e. when looking for "common"
> registers see if one register is the zero extension of another).  I
> have not tried it, but I suspect it will be fairly tricky -- a given
> solution will now have more than one "cost" depending on how you
> interpret specific registers (as truncations of some extended value
> or
> as a native wide value) and you'd have to search over the possible
> interpretations of a specific solution.
> 
> > Adding simplification barriers into SCEV expressions seems like a
> > dangerous idea, especially since LSR is not the only user of SCEV.
> 
> So I don't intend for SCEV to produce SCEVOpaqueExpr nodes itself.
> There would instead be a 'const SCEV *getOpaqueExpr(const SCEV *)'
> function that clients of SCEV can use to create an opaque expression.
> Then most of the simplification routines SCEV will be changed to have
> the moral equivalent of 'if (isa<SCEVOpaqueExpr>(E))
continue;'.
> 
> -- Sanjoy
> 
> [1]: http://lists.llvm.org/pipermail/llvm-dev/2015-April/084434.html
> 
-- 
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory

Sanjoy Das via llvm-dev

2015-Aug-17 21:38 UTC

head link

[llvm-dev] RFC for a design change in LoopStrengthReduce / ScalarEvolution

> To back up for a second, how much of this is self-inflicted damage?
> IndVarSimplify likes to preemptively widen induction variables. Is
> that why you have the extensions here in the first place?
In the specific example I was talking about the zext came from our
frontend (our FE used to insert these extensions for reasons that are
no longer relevant).  But you can easily get the same behavior from an
expression like `a[(unsigned long)i]`.

This looked like a fairly straightforward LSR problem to me,
especially because SCEV already has all the smarts to infer the
required properties.  The only thing missing the right framing of the
problem (i.e. framing it in a way such that we can implement a
maintainable solution).

-- Sanjoy

Steve King via llvm-dev

2015-Aug-18 00:35 UTC

head link

[llvm-dev] RFC for a design change in LoopStrengthReduce / ScalarEvolution

On Sun, Aug 16, 2015 at 7:43 PM, Sanjoy Das via llvm-dev
<llvm-dev at lists.llvm.org> wrote:> The problem is not about how to codegen a specific SCEV expression
> efficiently, but about the overall solution LSR converges to (which
> includes considering how it will codegen *other* SCEVs).  In the
> latter sense there is a difference between `X` and `(trunc (zext X))`
> even though both of them produce the same value (and possibly are
> equally efficient).
>
> The example in [1] is roughly like this at the high level:
>
> You have an i32 induction variable `V` == `{VStart,+,1}`.
> `{VStart,+,1}` does not unsigned-overflow given the loop bounds.  Two
> interesting uses in the loop are something like (in this mail by zext
> I'll always mean zext from i32 to i64) `GEP @Global, zext(V)` and
> `V ult limit`.
>
> Because SCEV can prove that `{VStart,+,1}` does not unsigned-overflow,
> `GEP @Global, zext(V)` can be computed as `GEP (@Global + zext
> VStart), {i64 0,+,1}`.  Similarly, `V` == `trunc (zext V)` => `trunc
{zext VStart,+,1}` == `trunc {zext VStart,+,1}` => `trunc({i64 0,+,1}) +
trunc (zext VStart)` => `trunc({i64 0,+,1}) + VStart`.  This is the solution
LSR ends up
> choosing.  However, the cheaper solution here is
>
> `GEP @Global, zext(V)` == `GEP @Global, zext({VStart,+,1})`
> and
> `V ult limit` == `{VStart,+,1} ult limit`
>
> (i.e. pretty much the direct naive lowering).
>
> This naive lowering is better because we can use the increment in
> `{VStart,+,1}` to zero extend the value implicitly on x86 (you can
> find the generated machine code at [1]).
>
> The reason why LSR does not pick this cheaper solution is because it
> "forgets" that `{zext VStart,+,1}` == `zext({VStart,+,1})`. 
I'm
> trying to fix this by adding a rule that replaces `{zext X,+,zext Y}`
> with `zext({X,+,Y})` when possible on architectures that have
"free"
> zero extensions, so that if there is a solution involving the narrower
> induction variable, it will be found.
>
> Now I could use the trick you mentioned in the place in LSR where it
> computes the cost of a solution (i.e. when looking for "common"
> registers see if one register is the zero extension of another).  I
> have not tried it, but I suspect it will be fairly tricky -- a given
> solution will now have more than one "cost" depending on how you
> interpret specific registers (as truncations of some extended value or
> as a native wide value) and you'd have to search over the possible
> interpretations of a specific solution.
>
>> Adding simplification barriers into SCEV expressions seems like a
>> dangerous idea, especially since LSR is not the only user of SCEV.
>
> So I don't intend for SCEV to produce SCEVOpaqueExpr nodes itself.
> There would instead be a 'const SCEV *getOpaqueExpr(const SCEV *)'
> function that clients of SCEV can use to create an opaque expression.
> Then most of the simplification routines SCEV will be changed to have
> the moral equivalent of 'if (isa<SCEVOpaqueExpr>(E))
continue;'.

Hi Sanjoy - I also noticed LSR was converging on suboptimal solutions.
I'm hoping you can clarify this discussion for me.
Given a simple copy loop:

void copy(int* src, int* dst, unsigned int size) {
    for (unsigned int i = 0; i < size; ++i)
        dst[i] = src[i];
}

The hot spot for i686 contains 6 instructions with LSR:

8b 32          mov    (%edx),%esi
89 31          mov    %esi,(%ecx)
83 c1 04     add    $0x4,%ecx
83 c2 04     add    $0x4,%edx
48             dec    %eax
75 f3          jne    11 <copy+0x11>

but only 5 instructions without LSR.

8b 3c b2      mov    (%edx,%esi,4),%edi
89 3c b1      mov    %edi,(%ecx,%esi,4)
46               inc    %esi
39 c6          cmp    %eax,%esi
75 f5           jne    14 <copy+0x14>

My interpretation of this problem was that LSR starts analysis with an
SCEV that counts down toward zero even in an upward counting loop.  A
more natural fit would be an initial IV going from 0 to some limit,
but there does not seem to be a way to express this sort of SCEV.  The
initial condition puts an artificially high cost on solutions with
upward counting registers since none match the "free" down counting
SCEV.  Notice the weird 'dec %eax' showing up as an artifact.  The
best solution with base-index scale 4 addressing was considered by
LSR, but incorrectly deemed to be too expensive.

Is this the same LSR problem you describe?

Thanks,
-steve

Sanjoy Das via llvm-dev

2015-Aug-18 05:39 UTC

head link

[llvm-dev] RFC for a design change in LoopStrengthReduce / ScalarEvolution

> My interpretation of this problem was that LSR starts analysis with an
> SCEV that counts down toward zero even in an upward counting loop.  A
> more natural fit would be an initial IV going from 0 to some limit,
> but there does not seem to be a way to express this sort of SCEV.  The
> initial condition puts an artificially high cost on solutions with
> upward counting registers since none match the "free" down
counting
> SCEV.  Notice the weird 'dec %eax' showing up as an artifact.  The
> best solution with base-index scale 4 addressing was considered by
> LSR, but incorrectly deemed to be too expensive.
>
> Is this the same LSR problem you describe?
No, these two issues look unrelated.

-- Sanjoy

llvm dev - Aug 2015 - RFC for a design change in LoopStrengthReduce / ScalarEvolution

[llvm-dev] RFC for a design change in LoopStrengthReduce / ScalarEvolution

[llvm-dev] RFC for a design change in LoopStrengthReduce / ScalarEvolution

[llvm-dev] RFC for a design change in LoopStrengthReduce / ScalarEvolution

[llvm-dev] RFC for a design change in LoopStrengthReduce / ScalarEvolution

[llvm-dev] RFC for a design change in LoopStrengthReduce / ScalarEvolution