thr3ads.net - llvm dev - [LLVMdev] Optimization hints for "constant" loads [Oct 2014]

If this information is useful, please help other people find it:
Share via:

Philip Reames

2014-Oct-11 01:07 UTC

[LLVMdev] Optimization hints for "constant" loads

On 09/28/2014 01:22 AM, Hal Finkel wrote:> ----- Original Message -----
>> From: "Philip Reames" <listmail at philipreames.com>
>> To: llvmdev at cs.uiuc.edu
>> Cc: "Hal Finkel" <hfinkel at anl.gov>, nicholas at
mxc.ca
>> Sent: Wednesday, September 10, 2014 12:11:28 AM
>> Subject: Optimization hints for "constant" loads
>>
>> I'm looking at how to optimize IR which contains reads from a field
>> which is known to be initialized exactly once.  I've got an idea on
>> how
>> to approach this, but wanted to see if others have alternate ideas or
>> similar problems which might spark discussion.  It feels like
there's
>> a
>> potentially generally useful optimization hint here if we can
>> generalize
>> it sufficiently without loosing optimization potential.
>>
>> The problem:
>> struct array {
>> private:
>>     // once initialized 'len' never changes
>>     int len;
>>     // data can be modified at will
>>     char data[0];
>> public:
>>     static array* make(int len) {
>>       array* a = ... allocate uninitialized space
>>       a->len = len;
>>       return a;
>>     }
>> };
>> void access(array* a, int idx) {
>>     if( idx >= 0 && idx <- a->len ) {
>>       a->data[idx] = 5;
>>     }
>> }
>> void foo(array* a) {
>>     for(int i = 0; i < a->len; i++) {
>>       access(a, i);
>>     }
>> }
>> // assume 'access' is inlined into 'foo' and the loop
is unrolled a
>> time
>> or two
>>
>> To phrase that again in english, I've got a field which is
>> initialized
>> once, with naive code which reads from it many times.  I know at IR
>> generation time that a load from array::len is special, but I loose
>> this
>> information as soon as I generate IR.  In particular, I run into
>> aliasing problems where the location a->len is considered
'mayalias'
>> with unrelated stores thus preventing value forwarding, LICM, and
>> other
>> desirable optimizations.
>>
>> Existing Approaches:
> I think that, at least in theory, we already have a solution to this
problem. If, after the initialization is complete, you insert a call to the
llvm.invariant.start intrinsic (and perhaps also do so at the start of any
routine you know can be called only after initialization is complete), that
should convey the information you want.
>
> I've never worked with this intrinsic before, but if this does not
work, I'd be really curious to know why. From some experiments, the existing invariant intrinsics appear nearly 
useless for my purposes.  The problem is that any call can hide a 
invariant.end intrinsic.  As a result, the optimizer must conservatively 
assume that any call clobbers the "invariant" location.  This makes
the
intrinsic a non-starter in it's current form.

; Function Attrs: uwtable
define zeroext i1 @_Z4testv() #0 {
   %1 = tail call noalias i8* @_Znwm(i64 4) #3
   %2 = bitcast i8* %1 to i32*
   store i32 5, i32* %2, align 4, !tbaa !1
   %discard = call {}* @llvm.invariant.start(i64 4, i8* %1)
   tail call void @_Z6escapePi(i32* %2)
   %3 = load i32* %2, align 4, !tbaa !1
   %4 = icmp eq i32 %3, 5 <-- this conditional should be folded away and is
not
   ret i1 %4
}

declare {}* @llvm.invariant.start(i64, i8*)

; Function Attrs: nobuiltin
declare noalias i8* @_Znwm(i64) #1

declare void @_Z6escapePi(i32*) #2

It also appears that the intrinsic has limited implementation in the 
optimizer.  Even surprisingly simple cases don't appear to kick in. 
Consider:
define zeroext i1 @_Z4testv() #0 {
   %1 = tail call noalias i8* @_Znwm(i64 4) #4
   %2 = bitcast i8* %1 to i32*
   store i32 5, i32* %2, align 4, !tbaa !1
   %discard = tail call {}* @llvm.invariant.start(i64 4, i8* %1)
   %3 = load i32* %2, align 4, !tbaa !1
   %4 = icmp eq i32 %3, 5 <-- This conditional should be folded
   tail call void @_Z6escapePi(i32* %2, i1 %4)
   %5 = load i32* %2, align 4, !tbaa !1
   %6 = icmp eq i32 %5, 5
   ret i1 %6
}


We could extend the existing intrinsic with a notion of invariant.start 
that has no end.  This could be as simple as adding a boolean parameter 
to the intrinsic.  I think this could be made to work.  There'd be other 
implementation work needed to make it actually useful, the validity 
based on dominance model used by assumes seems like an obvious candidate.

Thinking through this, I see a couple of places that we'd change:
- isLocationKnownInvariant(Value* Loc, Instruction* Cntx) and related 
analysis pass (analogous to assumptions)
   - The new "no end" version is easy (dominance)
   - The older version is a graph walk looking paths which include 
either an end, or call.
- eagerly propagate known invariant location store values in EarlyCSE 
(strictly by dominance for the new intrinsic form)
- Add an InstCombine rule to fold a store to a known invariant location 
to undef
- Teach GVN about it (gets the non dominance cases), could also do fun 
things which phis of invariant locations, but not sure that's actually 
useful
- Teach LICM to lift loads of known invariant locations out of loops

Does this seem like a workable approach?

Philip>> 1) Use TBAA -  By tagging loads and stores to the two fields of the
>> array struct as disjoint branches in the TBAA tree, I can inform LLVM
>> that a load of 'len' never aliases with a store through
'data'.  This
>> mostly works, and enables many loads to be forwarded by GVN, but (due
>> to
>> the intervening stores) is a complete loss in EarlyCSE and (due to
>> intervening calls) LICM.
>>      a) Things like http://llvm.org/bugs/show_bug.cgi?id=20805 could
>> improve the situation in EarlyCSE.
>> 2) Use "invariant.load" metadata - This metadata indicates
that the
>> field loaded from is initialized before the execution of the code
>> being
>> compiled.  In particular, "invariant.load" implies that the
load is
>> not
>> control dependent on any branch, is safe to speculate, and that no
>> write
>> aliases the location read from.  This mostly works, but only if
>> 'array::make' is never visible in the IR.   As soon as
'array::make'
>> gets inlined, all bets are off and mis-compiles may result.
>>      a) Also, in practice, only LICM really knows about
>> "invariant.load".  It would be pretty straight forward to
teach both
>> EarlyCSE and GVN about them though.
>> http://llvm.org/bugs/show_bug.cgi?id=20806
>>
>> New Approaches:
>> (This isn't so much "being proposed" as "being put
forward for
>> discussion".)
>>
>> 1) Introduce a new metadata type "initialized-before-load"
which
>> implies
>> that the value of any two loads tagged with the metadata along any
>> execution path must yield the same result.
>>
>> This doesn't give much freedom to the 'first' load;
it's still
>> control
>> dependent, can't be reordered with preceding stores, can't be
lifted
>> out
>> of loops, etc...  It can however be reordered with following stores
>> or
>> sunk out of a loop provided the loop body is known to execute at
>> least
>> once.
>>
>> The second load has a lot more freedom.  Provided that there is
>> always
>> another load to the same location (with the metadata) provable
>> preceding
>> it on all paths, it can be reordered freely, lifted over control
>> branches, lifted out of loops, etc... Importantly, it is also legal
>> to
>> forward the value of a preceding load to a later load provided that
>> a)
>> both have the metadata and b) that one load executes strictly before
>> the
>> other.
>>
>> A store marked "initialized-before-load" is undefined if
there is a
>> load
>> with the same metadata on the same location preceding it. There may
>> be
>> *multiple* stores to the location along a path, provided that the
>> first
>> load is strictly after *all* of them.
>>
>> This seems staight forward to implement in EarlyCSE and LICM.  I
>> haven't
>> looked closely at GVN, but expect it's probably not hard either.
>>
>> 2) Introduce a slightly different metadata
"initialized-once".
>> Semantics
>> are very similar to the preceding except that there can only be a
>> single
>> store to the location along any path.
>>
>> Value forwarding from the store to a following load (with metadata)
>> is
>> allowed regardless of potentially aliasing intervening stores.
>>
>> This was actually my original idea, but it has a couple of problems.
>> First, it breaks on surprisingly common initialization patterns such
>> as
>> default initialization followed by real initialization.  Secondly,
>> I'm
>> not sure the optimizer would always preserve the "write once"
>> property.
>> In particular, the optimizer is free to divide a large write into
>> several smaller ones (assuming the write is not atomic.)
>>
>>
>>
>> Thoughts?  Suggestions?  Similar sounding problems this might be able
>> to
>> solve?
>>
>> Philip
>>
>>

Hal Finkel

2014-Oct-12 01:01 UTC

head link

[LLVMdev] Optimization hints for "constant" loads

----- Original Message -----> From: "Philip Reames" <listmail at philipreames.com>
> To: "Hal Finkel" <hfinkel at anl.gov>
> Cc: nicholas at mxc.ca, llvmdev at cs.uiuc.edu
> Sent: Friday, October 10, 2014 8:07:55 PM
> Subject: Re: Optimization hints for "constant" loads
> 
> 
> On 09/28/2014 01:22 AM, Hal Finkel wrote:
> > ----- Original Message -----
> >> From: "Philip Reames" <listmail at
philipreames.com>
> >> To: llvmdev at cs.uiuc.edu
> >> Cc: "Hal Finkel" <hfinkel at anl.gov>, nicholas at
mxc.ca
> >> Sent: Wednesday, September 10, 2014 12:11:28 AM
> >> Subject: Optimization hints for "constant" loads
> >>
> >> I'm looking at how to optimize IR which contains reads from a
> >> field
> >> which is known to be initialized exactly once.  I've got an
idea
> >> on
> >> how
> >> to approach this, but wanted to see if others have alternate ideas
> >> or
> >> similar problems which might spark discussion.  It feels like
> >> there's
> >> a
> >> potentially generally useful optimization hint here if we can
> >> generalize
> >> it sufficiently without loosing optimization potential.
> >>
> >> The problem:
> >> struct array {
> >> private:
> >>     // once initialized 'len' never changes
> >>     int len;
> >>     // data can be modified at will
> >>     char data[0];
> >> public:
> >>     static array* make(int len) {
> >>       array* a = ... allocate uninitialized space
> >>       a->len = len;
> >>       return a;
> >>     }
> >> };
> >> void access(array* a, int idx) {
> >>     if( idx >= 0 && idx <- a->len ) {
> >>       a->data[idx] = 5;
> >>     }
> >> }
> >> void foo(array* a) {
> >>     for(int i = 0; i < a->len; i++) {
> >>       access(a, i);
> >>     }
> >> }
> >> // assume 'access' is inlined into 'foo' and the
loop is unrolled
> >> a
> >> time
> >> or two
> >>
> >> To phrase that again in english, I've got a field which is
> >> initialized
> >> once, with naive code which reads from it many times.  I know at
> >> IR
> >> generation time that a load from array::len is special, but I
> >> loose
> >> this
> >> information as soon as I generate IR.  In particular, I run into
> >> aliasing problems where the location a->len is considered
> >> 'mayalias'
> >> with unrelated stores thus preventing value forwarding, LICM, and
> >> other
> >> desirable optimizations.
> >>
> >> Existing Approaches:
> > I think that, at least in theory, we already have a solution to
> > this problem. If, after the initialization is complete, you insert
> > a call to the llvm.invariant.start intrinsic (and perhaps also do
> > so at the start of any routine you know can be called only after
> > initialization is complete), that should convey the information
> > you want.
> >
> > I've never worked with this intrinsic before, but if this does not
> > work, I'd be really curious to know why.
>  From some experiments, the existing invariant intrinsics appear
>  nearly
> useless for my purposes.  The problem is that any call can hide a
> invariant.end intrinsic.  As a result, the optimizer must
> conservatively
> assume that any call clobbers the "invariant" location.  This
makes
> the
> intrinsic a non-starter in it's current form.
> 
> ; Function Attrs: uwtable
> define zeroext i1 @_Z4testv() #0 {
>    %1 = tail call noalias i8* @_Znwm(i64 4) #3
>    %2 = bitcast i8* %1 to i32*
>    store i32 5, i32* %2, align 4, !tbaa !1
>    %discard = call {}* @llvm.invariant.start(i64 4, i8* %1)
>    tail call void @_Z6escapePi(i32* %2)
>    %3 = load i32* %2, align 4, !tbaa !1
>    %4 = icmp eq i32 %3, 5 <-- this conditional should be folded away
>    and is not
>    ret i1 %4
> }
> 
> declare {}* @llvm.invariant.start(i64, i8*)
> 
> ; Function Attrs: nobuiltin
> declare noalias i8* @_Znwm(i64) #1
> 
> declare void @_Z6escapePi(i32*) #2
> 
> It also appears that the intrinsic has limited implementation in the
> optimizer.  Even surprisingly simple cases don't appear to kick in.
> Consider:
> define zeroext i1 @_Z4testv() #0 {
>    %1 = tail call noalias i8* @_Znwm(i64 4) #4
>    %2 = bitcast i8* %1 to i32*
>    store i32 5, i32* %2, align 4, !tbaa !1
>    %discard = tail call {}* @llvm.invariant.start(i64 4, i8* %1)
>    %3 = load i32* %2, align 4, !tbaa !1
>    %4 = icmp eq i32 %3, 5 <-- This conditional should be folded
>    tail call void @_Z6escapePi(i32* %2, i1 %4)
>    %5 = load i32* %2, align 4, !tbaa !1
>    %6 = icmp eq i32 %5, 5
>    ret i1 %6
> }
> 
> 
> We could extend the existing intrinsic with a notion of
> invariant.start
> that has no end.  This could be as simple as adding a boolean
> parameter
> to the intrinsic.  I think this could be made to work.  There'd be
> other
> implementation work needed to make it actually useful, the validity
> based on dominance model used by assumes seems like an obvious
> candidate.
> 
> Thinking through this, I see a couple of places that we'd change:
> - isLocationKnownInvariant(Value* Loc, Instruction* Cntx) and related
> analysis pass (analogous to assumptions)
>    - The new "no end" version is easy (dominance)
>    - The older version is a graph walk looking paths which include
> either an end, or call.
> - eagerly propagate known invariant location store values in EarlyCSE
> (strictly by dominance for the new intrinsic form)
> - Add an InstCombine rule to fold a store to a known invariant
> location
> to undef
> - Teach GVN about it (gets the non dominance cases), could also do
> fun
> things which phis of invariant locations, but not sure that's
> actually
> useful
> - Teach LICM to lift loads of known invariant locations out of loops
> 
> Does this seem like a workable approach?
Yes, I think this sounds quite reasonable.

 -Hal
> 
> Philip
> >> 1) Use TBAA -  By tagging loads and stores to the two fields of
> >> the
> >> array struct as disjoint branches in the TBAA tree, I can inform
> >> LLVM
> >> that a load of 'len' never aliases with a store through
'data'.
> >>  This
> >> mostly works, and enables many loads to be forwarded by GVN, but
> >> (due
> >> to
> >> the intervening stores) is a complete loss in EarlyCSE and (due to
> >> intervening calls) LICM.
> >>      a) Things like http://llvm.org/bugs/show_bug.cgi?id=20805
> >>      could
> >> improve the situation in EarlyCSE.
> >> 2) Use "invariant.load" metadata - This metadata
indicates that
> >> the
> >> field loaded from is initialized before the execution of the code
> >> being
> >> compiled.  In particular, "invariant.load" implies that
the load
> >> is
> >> not
> >> control dependent on any branch, is safe to speculate, and that no
> >> write
> >> aliases the location read from.  This mostly works, but only if
> >> 'array::make' is never visible in the IR.   As soon as
> >> 'array::make'
> >> gets inlined, all bets are off and mis-compiles may result.
> >>      a) Also, in practice, only LICM really knows about
> >> "invariant.load".  It would be pretty straight forward
to teach
> >> both
> >> EarlyCSE and GVN about them though.
> >> http://llvm.org/bugs/show_bug.cgi?id=20806
> >>
> >> New Approaches:
> >> (This isn't so much "being proposed" as "being
put forward for
> >> discussion".)
> >>
> >> 1) Introduce a new metadata type
"initialized-before-load" which
> >> implies
> >> that the value of any two loads tagged with the metadata along any
> >> execution path must yield the same result.
> >>
> >> This doesn't give much freedom to the 'first' load;
it's still
> >> control
> >> dependent, can't be reordered with preceding stores, can't
be
> >> lifted
> >> out
> >> of loops, etc...  It can however be reordered with following
> >> stores
> >> or
> >> sunk out of a loop provided the loop body is known to execute at
> >> least
> >> once.
> >>
> >> The second load has a lot more freedom.  Provided that there is
> >> always
> >> another load to the same location (with the metadata) provable
> >> preceding
> >> it on all paths, it can be reordered freely, lifted over control
> >> branches, lifted out of loops, etc... Importantly, it is also
> >> legal
> >> to
> >> forward the value of a preceding load to a later load provided
> >> that
> >> a)
> >> both have the metadata and b) that one load executes strictly
> >> before
> >> the
> >> other.
> >>
> >> A store marked "initialized-before-load" is undefined if
there is
> >> a
> >> load
> >> with the same metadata on the same location preceding it. There
> >> may
> >> be
> >> *multiple* stores to the location along a path, provided that the
> >> first
> >> load is strictly after *all* of them.
> >>
> >> This seems staight forward to implement in EarlyCSE and LICM.  I
> >> haven't
> >> looked closely at GVN, but expect it's probably not hard
either.
> >>
> >> 2) Introduce a slightly different metadata
"initialized-once".
> >> Semantics
> >> are very similar to the preceding except that there can only be a
> >> single
> >> store to the location along any path.
> >>
> >> Value forwarding from the store to a following load (with
> >> metadata)
> >> is
> >> allowed regardless of potentially aliasing intervening stores.
> >>
> >> This was actually my original idea, but it has a couple of
> >> problems.
> >> First, it breaks on surprisingly common initialization patterns
> >> such
> >> as
> >> default initialization followed by real initialization.  Secondly,
> >> I'm
> >> not sure the optimizer would always preserve the "write
once"
> >> property.
> >> In particular, the optimizer is free to divide a large write into
> >> several smaller ones (assuming the write is not atomic.)
> >>
> >>
> >>
> >> Thoughts?  Suggestions?  Similar sounding problems this might be
> >> able
> >> to
> >> solve?
> >>
> >> Philip
> >>
> >>
> 
> 
-- 
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory

Philip Reames

2014-Oct-13 21:59 UTC

head link

[LLVMdev] Optimization hints for "constant" loads

On 10/11/2014 06:01 PM, Hal Finkel wrote:> ----- Original Message -----
>> From: "Philip Reames" <listmail at philipreames.com>
>> To: "Hal Finkel" <hfinkel at anl.gov>
>> Cc: nicholas at mxc.ca, llvmdev at cs.uiuc.edu
>> Sent: Friday, October 10, 2014 8:07:55 PM
>> Subject: Re: Optimization hints for "constant" loads
>>
>>
>> On 09/28/2014 01:22 AM, Hal Finkel wrote:
>>> ----- Original Message -----
>>>> From: "Philip Reames" <listmail at
philipreames.com>
>>>> To: llvmdev at cs.uiuc.edu
>>>> Cc: "Hal Finkel" <hfinkel at anl.gov>, nicholas
at mxc.ca
>>>> Sent: Wednesday, September 10, 2014 12:11:28 AM
>>>> Subject: Optimization hints for "constant" loads
>>>>
>>>> I'm looking at how to optimize IR which contains reads from
a
>>>> field
>>>> which is known to be initialized exactly once.  I've got an
idea
>>>> on
>>>> how
>>>> to approach this, but wanted to see if others have alternate
ideas
>>>> or
>>>> similar problems which might spark discussion.  It feels like
>>>> there's
>>>> a
>>>> potentially generally useful optimization hint here if we can
>>>> generalize
>>>> it sufficiently without loosing optimization potential.
>>>>
>>>> The problem:
>>>> struct array {
>>>> private:
>>>>      // once initialized 'len' never changes
>>>>      int len;
>>>>      // data can be modified at will
>>>>      char data[0];
>>>> public:
>>>>      static array* make(int len) {
>>>>        array* a = ... allocate uninitialized space
>>>>        a->len = len;
>>>>        return a;
>>>>      }
>>>> };
>>>> void access(array* a, int idx) {
>>>>      if( idx >= 0 && idx <- a->len ) {
>>>>        a->data[idx] = 5;
>>>>      }
>>>> }
>>>> void foo(array* a) {
>>>>      for(int i = 0; i < a->len; i++) {
>>>>        access(a, i);
>>>>      }
>>>> }
>>>> // assume 'access' is inlined into 'foo' and
the loop is unrolled
>>>> a
>>>> time
>>>> or two
>>>>
>>>> To phrase that again in english, I've got a field which is
>>>> initialized
>>>> once, with naive code which reads from it many times.  I know
at
>>>> IR
>>>> generation time that a load from array::len is special, but I
>>>> loose
>>>> this
>>>> information as soon as I generate IR.  In particular, I run
into
>>>> aliasing problems where the location a->len is considered
>>>> 'mayalias'
>>>> with unrelated stores thus preventing value forwarding, LICM,
and
>>>> other
>>>> desirable optimizations.
>>>>
>>>> Existing Approaches:
>>> I think that, at least in theory, we already have a solution to
>>> this problem. If, after the initialization is complete, you insert
>>> a call to the llvm.invariant.start intrinsic (and perhaps also do
>>> so at the start of any routine you know can be called only after
>>> initialization is complete), that should convey the information
>>> you want.
>>>
>>> I've never worked with this intrinsic before, but if this does
not
>>> work, I'd be really curious to know why.
>>   From some experiments, the existing invariant intrinsics appear
>>   nearly
>> useless for my purposes.  The problem is that any call can hide a
>> invariant.end intrinsic.  As a result, the optimizer must
>> conservatively
>> assume that any call clobbers the "invariant" location.  This
makes
>> the
>> intrinsic a non-starter in it's current form.
>>
>> ; Function Attrs: uwtable
>> define zeroext i1 @_Z4testv() #0 {
>>     %1 = tail call noalias i8* @_Znwm(i64 4) #3
>>     %2 = bitcast i8* %1 to i32*
>>     store i32 5, i32* %2, align 4, !tbaa !1
>>     %discard = call {}* @llvm.invariant.start(i64 4, i8* %1)
>>     tail call void @_Z6escapePi(i32* %2)
>>     %3 = load i32* %2, align 4, !tbaa !1
>>     %4 = icmp eq i32 %3, 5 <-- this conditional should be folded
away
>>     and is not
>>     ret i1 %4
>> }
>>
>> declare {}* @llvm.invariant.start(i64, i8*)
>>
>> ; Function Attrs: nobuiltin
>> declare noalias i8* @_Znwm(i64) #1
>>
>> declare void @_Z6escapePi(i32*) #2
>>
>> It also appears that the intrinsic has limited implementation in the
>> optimizer.  Even surprisingly simple cases don't appear to kick in.
>> Consider:
>> define zeroext i1 @_Z4testv() #0 {
>>     %1 = tail call noalias i8* @_Znwm(i64 4) #4
>>     %2 = bitcast i8* %1 to i32*
>>     store i32 5, i32* %2, align 4, !tbaa !1
>>     %discard = tail call {}* @llvm.invariant.start(i64 4, i8* %1)
>>     %3 = load i32* %2, align 4, !tbaa !1
>>     %4 = icmp eq i32 %3, 5 <-- This conditional should be folded
>>     tail call void @_Z6escapePi(i32* %2, i1 %4)
>>     %5 = load i32* %2, align 4, !tbaa !1
>>     %6 = icmp eq i32 %5, 5
>>     ret i1 %6
>> }
>>
>>
>> We could extend the existing intrinsic with a notion of
>> invariant.start
>> that has no end.  This could be as simple as adding a boolean
>> parameter
>> to the intrinsic.  I think this could be made to work.  There'd be
>> other
>> implementation work needed to make it actually useful, the validity
>> based on dominance model used by assumes seems like an obvious
>> candidate.
>>
>> Thinking through this, I see a couple of places that we'd change:
>> - isLocationKnownInvariant(Value* Loc, Instruction* Cntx) and related
>> analysis pass (analogous to assumptions)
>>     - The new "no end" version is easy (dominance)
>>     - The older version is a graph walk looking paths which include
>> either an end, or call.
>> - eagerly propagate known invariant location store values in EarlyCSE
>> (strictly by dominance for the new intrinsic form)
>> - Add an InstCombine rule to fold a store to a known invariant
>> location
>> to undef
>> - Teach GVN about it (gets the non dominance cases), could also do
>> fun
>> things which phis of invariant locations, but not sure that's
>> actually
>> useful
>> - Teach LICM to lift loads of known invariant locations out of loops
>>
>> Does this seem like a workable approach?
> Yes, I think this sounds quite reasonable.I'll start throwing together some patches.  It'll take a couple of weeks
due to current distractions, but I'll get something up for review when I 
can.

Philip

Andrew Trick

2014-Oct-20 23:35 UTC

head link

[LLVMdev] Optimization hints for "constant" loads

> On Oct 10, 2014, at 6:07 PM, Philip Reames <listmail at
philipreames.com> wrote:
> 
> 
> On 09/28/2014 01:22 AM, Hal Finkel wrote:
>> ----- Original Message -----
>>> From: "Philip Reames" <listmail at
philipreames.com>
>>> To: llvmdev at cs.uiuc.edu
>>> Cc: "Hal Finkel" <hfinkel at anl.gov>, nicholas at
mxc.ca
>>> Sent: Wednesday, September 10, 2014 12:11:28 AM
>>> Subject: Optimization hints for "constant" loads
>>> 
>>> I'm looking at how to optimize IR which contains reads from a
field
>>> which is known to be initialized exactly once.  I've got an
idea on
>>> how
>>> to approach this, but wanted to see if others have alternate ideas
or
>>> similar problems which might spark discussion.  It feels like
there's
>>> a
>>> potentially generally useful optimization hint here if we can
>>> generalize
>>> it sufficiently without loosing optimization potential.
>>> 
>>> The problem:
>>> struct array {
>>> private:
>>>    // once initialized 'len' never changes
>>>    int len;
>>>    // data can be modified at will
>>>    char data[0];
>>> public:
>>>    static array* make(int len) {
>>>      array* a = ... allocate uninitialized space
>>>      a->len = len;
>>>      return a;
>>>    }
>>> };
>>> void access(array* a, int idx) {
>>>    if( idx >= 0 && idx <- a->len ) {
>>>      a->data[idx] = 5;
>>>    }
>>> }
>>> void foo(array* a) {
>>>    for(int i = 0; i < a->len; i++) {
>>>      access(a, i);
>>>    }
>>> }
>>> // assume 'access' is inlined into 'foo' and the
loop is unrolled a
>>> time
>>> or two
>>> 
>>> To phrase that again in english, I've got a field which is
>>> initialized
>>> once, with naive code which reads from it many times.  I know at IR
>>> generation time that a load from array::len is special, but I loose
>>> this
>>> information as soon as I generate IR.  In particular, I run into
>>> aliasing problems where the location a->len is considered
'mayalias'
>>> with unrelated stores thus preventing value forwarding, LICM, and
>>> other
>>> desirable optimizations.
>>> 
>>> Existing Approaches:
>> I think that, at least in theory, we already have a solution to this
problem. If, after the initialization is complete, you insert a call to the
llvm.invariant.start intrinsic (and perhaps also do so at the start of any
routine you know can be called only after initialization is complete), that
should convey the information you want.
>> 
>> I've never worked with this intrinsic before, but if this does not
work, I'd be really curious to know why.
> From some experiments, the existing invariant intrinsics appear nearly
useless for my purposes.  The problem is that any call can hide a invariant.end
intrinsic.  As a result, the optimizer must conservatively assume that any call
clobbers the "invariant" location.  This makes the intrinsic a
non-starter in it's current form.
> 
> ; Function Attrs: uwtable
> define zeroext i1 @_Z4testv() #0 {
>  %1 = tail call noalias i8* @_Znwm(i64 4) #3
>  %2 = bitcast i8* %1 to i32*
>  store i32 5, i32* %2, align 4, !tbaa !1
>  %discard = call {}* @llvm.invariant.start(i64 4, i8* %1)
>  tail call void @_Z6escapePi(i32* %2)
>  %3 = load i32* %2, align 4, !tbaa !1
>  %4 = icmp eq i32 %3, 5 <-- this conditional should be folded away and
is not
>  ret i1 %4
> }
> 
> declare {}* @llvm.invariant.start(i64, i8*)
> 
> ; Function Attrs: nobuiltin
> declare noalias i8* @_Znwm(i64) #1
> 
> declare void @_Z6escapePi(i32*) #2
> 
> It also appears that the intrinsic has limited implementation in the
optimizer.  Even surprisingly simple cases don't appear to kick in.
Consider:
> define zeroext i1 @_Z4testv() #0 {
>  %1 = tail call noalias i8* @_Znwm(i64 4) #4
>  %2 = bitcast i8* %1 to i32*
>  store i32 5, i32* %2, align 4, !tbaa !1
>  %discard = tail call {}* @llvm.invariant.start(i64 4, i8* %1)
>  %3 = load i32* %2, align 4, !tbaa !1
>  %4 = icmp eq i32 %3, 5 <-- This conditional should be folded
>  tail call void @_Z6escapePi(i32* %2, i1 %4)
>  %5 = load i32* %2, align 4, !tbaa !1
>  %6 = icmp eq i32 %5, 5
>  ret i1 %6
> }
> 
> 
> We could extend the existing intrinsic with a notion of invariant.start
that has no end.  This could be as simple as adding a boolean parameter to the
intrinsic.  I think this could be made to work.  There'd be other
implementation work needed to make it actually useful, the validity based on
dominance model used by assumes seems like an obvious candidate.
> 
> Thinking through this, I see a couple of places that we'd change:
> - isLocationKnownInvariant(Value* Loc, Instruction* Cntx) and related
analysis pass (analogous to assumptions)
>  - The new "no end" version is easy (dominance)
>  - The older version is a graph walk looking paths which include either an
end, or call.
> - eagerly propagate known invariant location store values in EarlyCSE
(strictly by dominance for the new intrinsic form)
> - Add an InstCombine rule to fold a store to a known invariant location to
undef
> - Teach GVN about it (gets the non dominance cases), could also do fun
things which phis of invariant locations, but not sure that's actually
useful
> - Teach LICM to lift loads of known invariant locations out of loops
> 
> Does this seem like a workable approach?
This is also something that I've wanted to get support for, so thanks for
working on it.

I've never realy understood how the llvm.invariant intrinsics could be
put into practice. There is the problem that "end" can occur anywhere
as you suggested fixing with a flag. But there is also the problem
that "start" needs to be visible in-scope. Additionally,
llvm.invariant is inherently control dependent, which doesn't really
allow the optimizer to hoist the invariant loads the way we want it
to. Are you planning to have your frontend emit llvm.invariant.start
intrinsics *everywhere* the field is accessed. If the frontend creates
a bunch of small llvm.invariant scopes, with arbitrary calls in
between, we won't be able to hoist or combine them. Although with your
proposal we could at least eliminate the llvm.invariant calls that are
dominated by another identical one.

I think it would be simpler and more powerful for the frontend to
treat loads from a particular type as invariant and have a safe way to
express the initialization. In other words, the frontend could always
emit the invariant metadata on the loads, but guard the store to avoid
loading before initialization.

The question is how to guard the initial store(s). Because we treat
invariant metadata as effectively meaning "can't alias", I think
we
need to express this as a data dependence on the pointer that we load
from.

Conceptually:

%p = unaliased address
store val, %p
%a = llvm.safe_cast %p ; just made this up
val = load %a !invariant

This approach would give optimization maximum freedom in code that
doesn't involve initialization. It would block store-load forwarding
from the initializer itself to the invariant loads, but I think
handling that is a simple peephole.

There are multiple important use-cases for write-once data:
- Constant globals with nontrivial initialization.
- Runtime metadata, including but not limited to object headers.
- Builtin data structures with immutable semantics.
- Maybe even language features for user defined data structures.

In some cases, we want to lazilly initialize a value. Some cases even
have conditionally invariant values. Ideally, we can express all of
these situations in LLVM IR. I think of the generalization of the
problem as:

- some arbitrary condition produces a token
- all loads from the immutable memory must acquire this token

It might look like this:

%p1 = ...
%p2 = llvm.safe_cast %p1
%a = select i1 %is_immutable, %p1, %p2

load %a !invariant

If %is_immutable can be proven to be true, then the safe_cast becomes
a nop. Otherwise, the safe_cast appears to read memory reachable from
its pointer, preventing stores from being reordered across it. The
safe_cast is readonly, so loads can be hoisted and optimized across it
and redundant stores can be eliminated.

Examples:

(1) Immutable fields in builtin data structures (Array.length)

; initialization

%p = <allocate>
store %v0, %p
%a = llvm.safe_cast %p
store %a, %container

...

; access to the immutable data in a different scope

%a = load %container
%v1 = load %a, !invariant

Here the second load is invariant but depends on the first load of the
container object (array), %a. The first load effectively acquires the
token. This load is not generally invariant, but likely loop invariant
and may benefit from TBAA. We would get the same effect by returning
the pointer to the "new" array and wouldn't need the first load in
that case.

(2) Lazilly initialized data

%p1 = ...
if (!%is_initialized) ; some arbitrary condition
  store %initval, %p1
%p2 = llvm.safe_cast %p1
%a = select i1 %is_initialized, %p1, %p2
%v = load %a, !invariant

Here we have an invariant load guarded by a condition,
%is_initialized. If the optimizer can reason about the condition,
which may be tested repeatedly in the same scope, it can optimize away
the lazy initialization path. These patterns can potentially be
optimized via correlated value propagation, jump-threading, and
unswitching.

(3) Conditionally immutable data

I realize this case is confusing but I'm laying out an example here anyway
as a proof of concept.

; Load the internal state of an object. Once it is immutable, it is
; always immutable. This computation is marked invariant so it can be
; hoisted and combined.

%invariant_internal_state = load %object, !invariant
%invariant_is_immutable_flag = load %invariant_internal_state, !invariant
%invariant_is_immutable_check = cmp %invariant_is_immutable_flag, ...

; Any subsequent load of the internal state after it may have changed
; must be guarded.

%safe_object = llvm.safe_cast %object

; conditionally reload the object

%reload_internal_state = load %safe_object
%internal_state   select i1 %invariant_is_immutable_check,
            %invariant_internal_state,
            %reload_internal_state, 

The %invariant_is_immutable_check can now be combined and hoisted out
of loops.

This case is tricky because every load of the object's internal state
must be guarded. It's up to the front end to handle this. This
technique may be useful if you need to aggressively optimize for case
in which an object is dynamically determined to be immutable, and that
is expected to be the common case.

—

I think the safe_cast intrinsic would work for your array.length case
and also for initialization of metadata. It wouldn't be as great for
standard global variables because we would either need to introduce a
level of indirection to get at the global, introduce a flag to check
initialization, or add a safe_cast to every read of the variable.

For this technique to be really effective, we also need to give the
invariant loads may-speculate semantics. Do you have a plan for that?
I'm fine with introducing another "may-speculate" metadata type
for it
initially. We could then use safe_cast intrinsic to guard these loads
on type checks in addition to initialization.

I would also give the safe_cast a constant ID and semantics so that
only casts with the same ID could be combined. For example, two checks
on the same type can be combined if one dominates the other,
regardless of the intervening code. However, we could have multiple
independent checks on the same pointer value, each with different
control dependencies.

-Andy
> 
> Philip
>>> 1) Use TBAA -  By tagging loads and stores to the two fields of the
>>> array struct as disjoint branches in the TBAA tree, I can inform
LLVM
>>> that a load of 'len' never aliases with a store through
'data'.  This
>>> mostly works, and enables many loads to be forwarded by GVN, but
(due
>>> to
>>> the intervening stores) is a complete loss in EarlyCSE and (due to
>>> intervening calls) LICM.
>>>     a) Things like http://llvm.org/bugs/show_bug.cgi?id=20805 could
>>> improve the situation in EarlyCSE.
>>> 2) Use "invariant.load" metadata - This metadata
indicates that the
>>> field loaded from is initialized before the execution of the code
>>> being
>>> compiled.  In particular, "invariant.load" implies that
the load is
>>> not
>>> control dependent on any branch, is safe to speculate, and that no
>>> write
>>> aliases the location read from.  This mostly works, but only if
>>> 'array::make' is never visible in the IR.   As soon as
'array::make'
>>> gets inlined, all bets are off and mis-compiles may result.
>>>     a) Also, in practice, only LICM really knows about
>>> "invariant.load".  It would be pretty straight forward to
teach both
>>> EarlyCSE and GVN about them though.
>>> http://llvm.org/bugs/show_bug.cgi?id=20806
>>> 
>>> New Approaches:
>>> (This isn't so much "being proposed" as "being
put forward for
>>> discussion".)
>>> 
>>> 1) Introduce a new metadata type
"initialized-before-load" which
>>> implies
>>> that the value of any two loads tagged with the metadata along any
>>> execution path must yield the same result.
>>> 
>>> This doesn't give much freedom to the 'first' load;
it's still
>>> control
>>> dependent, can't be reordered with preceding stores, can't
be lifted
>>> out
>>> of loops, etc...  It can however be reordered with following stores
>>> or
>>> sunk out of a loop provided the loop body is known to execute at
>>> least
>>> once.
>>> 
>>> The second load has a lot more freedom.  Provided that there is
>>> always
>>> another load to the same location (with the metadata) provable
>>> preceding
>>> it on all paths, it can be reordered freely, lifted over control
>>> branches, lifted out of loops, etc... Importantly, it is also legal
>>> to
>>> forward the value of a preceding load to a later load provided that
>>> a)
>>> both have the metadata and b) that one load executes strictly
before
>>> the
>>> other.
>>> 
>>> A store marked "initialized-before-load" is undefined if
there is a
>>> load
>>> with the same metadata on the same location preceding it. There may
>>> be
>>> *multiple* stores to the location along a path, provided that the
>>> first
>>> load is strictly after *all* of them.
>>> 
>>> This seems staight forward to implement in EarlyCSE and LICM.  I
>>> haven't
>>> looked closely at GVN, but expect it's probably not hard
either.
>>> 
>>> 2) Introduce a slightly different metadata
"initialized-once".
>>> Semantics
>>> are very similar to the preceding except that there can only be a
>>> single
>>> store to the location along any path.
>>> 
>>> Value forwarding from the store to a following load (with metadata)
>>> is
>>> allowed regardless of potentially aliasing intervening stores.
>>> 
>>> This was actually my original idea, but it has a couple of
problems.
>>> First, it breaks on surprisingly common initialization patterns
such
>>> as
>>> default initialization followed by real initialization.  Secondly,
>>> I'm
>>> not sure the optimizer would always preserve the "write
once"
>>> property.
>>> In particular, the optimizer is free to divide a large write into
>>> several smaller ones (assuming the write is not atomic.)
>>> 
>>> 
>>> 
>>> Thoughts?  Suggestions?  Similar sounding problems this might be
able
>>> to
>>> solve?
>>> 
>>> Philip
>>> 
>>> 
> 
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev

Sanjoy Das

2014-Oct-21 07:29 UTC

head link

[LLVMdev] Optimization hints for "constant" loads

> I've never realy understood how the llvm.invariant intrinsics could be
> put into practice. There is the problem that "end" can occur
anywhere
> as you suggested fixing with a flag.
I was under this impression too, but after re-reading the language
reference I'm not so sure -- it says about invariant.start: "This
intrinsic indicates that
until an llvm.invariant.end that uses the return value, the referenced
memory location is constant and unchanging.".  I think this implies
that if the result of an `llvm.invariant.start` doesn't escape, we can
safely conclude that there can be no matching `llvm.invariant.end`.

I'm still a little hazy on all of the use cases you want to support,
but one problem with llvm.safe_cast is, as you note, stores to the
pointer passed to safe_cast will not be forwarded to loads from the
pointer it returns.  I think this issue can be solved by modeling
`llvm.safe_cast` not as a transformation on a pointer, but as a check.
Specifically, I imagine a readonly intrinsic `llvm.assert_immutable`
that "asserts" that the pointer passed to it is immutable, and unwinds
if that fact isn't true (these are only imaginary semantics -- we know
the intrinsic will never actually unwind).  This means
`llvm.assert_immutable` can't be marked nounwind (otherwise it would
just get optimized away); but since it doesn't need to unwind to the
same frame, a CallInst is sufficient (an InvokeInst would've increased
the CFG's complexity).

So

  %p = unaliased address
  store 42, %p
  %a = llvm.safe_cast %p ; just made this up
  %val = load %a !invariant

becomes

  %p = unaliased address
  store 42, %p
  call void @llvm.assert_immutable(%p)
  %val = load %p !invariant

AFAICT, "load %p" can not, in general, be re-ordered to before the
call to @llvm.assert_immutable because we don't know what condition it
uses to decide if it should unwind.  There are cases where I think
such a re-ordering would be legal (an unused %val is one example,
another example is where the only use of %val is a comparison with
undef) but if I understand correctly, re-ordering a load with the call
to @llvm.assert_immutable can only be a performance loss -- it will
change dominance in a way that we'll "forget" that a load was
immutable.  And I think in most of the cases that are interesting to
optimize, the re-ordering will not be legal.

It is still legal to value forward the store though:

  %p = unaliased address
  store 42, %p
  call void @llvm.assert_immutable(%p)
  %val = 42

because *if* assert_immutable does not unwind, %val will see the
dominating store, because assert_immutable is readonly.

Again,

  %p1 = ...
  %p2 = llvm.safe_cast %p1
  %a = select i1 %is_immutable, %p1, %p2
  load %a !invariant

could become

  %p1 = ...
  if (not %is_immutable) {
    call llvm.assert_immutable(%p1)
  }
  load %p1

or, if we wish to reduce control flow, we could define
`llvm.assert_immutable` to be a no-op on null:

  %p1 = ...
  %a = select i1 %is_immutable, null, %p1,
  call llvm.assert_immutable(%a)
  load %p1 !invariant


`llvm.assert_immutable` does with control dependencies what
`llvm.safe_cast` does with data dependencies.
> llvm.invariant is inherently control dependent, which doesn't really
> allow the optimizer to hoist the invariant loads the way we want it
> to.
I'm curious about why you think so -- do you have examples that
demonstrate this?  And is this a flaw in how llvm implements llvm.invariant
or something fundamental about control dependencies?

-- Sanjoy

Possibly Parallel Threads

Search for more seemingly similar threads

llvm dev - Oct 2014 - [LLVMdev] Optimization hints for "constant" loads

[LLVMdev] Optimization hints for "constant" loads

[LLVMdev] Optimization hints for "constant" loads

[LLVMdev] Optimization hints for "constant" loads

[LLVMdev] Optimization hints for "constant" loads

[LLVMdev] Optimization hints for "constant" loads

Possibly Parallel Threads