thr3ads.net - llvm dev - [LLVMdev] RFC: PerfGuide for frontend authors [Feb 2015]

If this information is useful, please help other people find it:
Share via:

Philip Reames

2015-Feb-28 22:23 UTC

[LLVMdev] RFC: PerfGuide for frontend authors

On 02/28/2015 10:04 AM, Björn Steinbrink wrote:> Hi,
>
> On 2015.02.28 10:53:35 -0600, Hal Finkel wrote:
>> ----- Original Message -----
>>> From: "Philip Reames" <listmail at
philipreames.com>
>>>> 6. Use the lifetime.start/lifetime.end and
>>>> invariant.start/invariant.end intrinsics where possible
>>> Do you find these help in practice?  The few experiments I ran were
>>> neutral at best and harmful in one or two cases.  Do you have
>>> suggestions on how and when to use them?
>> Good point, we should be more specific here. My, admittedly limited,
>> experience with these is that they're most useful when their
>> properties are not dynamic -- which perhaps means that they
>> post-dominate the entry, and are applied to allocas in the entry block
>> -- and the larger the objects in question, the more the potential
>> stack-space savings, etc.
> my experience adding support for the lifetime intrinsics to the rust
> compiler is largely positive (because our code is very stack heavy at
> the moment), but we still suffer from missed memcpy optimizations.
> That happens because I made the lifetime regions as small as possible,
> and sometimes an alloca starts its lifetime too late for the optimization
> to happen.  My new (but not yet implemented) approach to to
"align" the
> calls to lifetime.start for allocas with overlapping lifetimes unless
> there's actually a possibility for stack slot sharing.
>
> For example we currently translate:
>
>      let a = [0; 1000000]; // Array of 1000000 zeros
>      {
>        let b = a;
>      }
>      let c = something;
>
> to roughly this:
>
>      lifetime.start(a)
>      memset(a, 0, 1000000)
>      
>      lifetime.start(b)
>      memcpy(b, a)
>      lifetime.end(b)
>      
>      lifetime.start(c)
>      lifetime.end(c)
>      
>      lifetime.end(a)
>
> The lifetime.start call for "b" stops the call-slot (I think)
> optimization from being applied. So instead this should be translated to
> something like:
>
>      lifetime.start(a)
>      lifetime.start(b)
>      memset(a, 0, 1000000)
>      
>      memcpy(b, a)
>      lifetime.end(b)
>      
>      lifetime.start(c)
>      lifetime.end(c)
>      
>      lifetime.end(a)
>
> extending the lifetime of "b" because it overlaps with that of
"a"
> anyway. The lifetime of "c" still starts after the end of
"b"'s lifetime
> because there's actually a possibility for stack slot sharing.
>
> BjörnI'd be interested in seeing the IR for this that you're currently 
generating.  Unless I'm misreading your example, everything in this is 
completely dead.  We should be able to reduce this to nothing and if we 
can't, it's clearly a missed optimization.  I'm particularly
interested
in how the difference in placement of the lifetime start for 'b' effects
optimization.  I really wouldn't expect that.

Philip

Björn Steinbrink

2015-Feb-28 22:30 UTC

head link

[LLVMdev] RFC: PerfGuide for frontend authors

On 2015.02.28 14:23:02 -0800, Philip Reames wrote:> On 02/28/2015 10:04 AM, Björn Steinbrink wrote:
> >Hi,
> >
> >On 2015.02.28 10:53:35 -0600, Hal Finkel wrote:
> >>----- Original Message -----
> >>>From: "Philip Reames" <listmail at
philipreames.com>
> >>>>6. Use the lifetime.start/lifetime.end and
> >>>>invariant.start/invariant.end intrinsics where possible
> >>>Do you find these help in practice?  The few experiments I ran
were
> >>>neutral at best and harmful in one or two cases.  Do you have
> >>>suggestions on how and when to use them?
> >>Good point, we should be more specific here. My, admittedly
limited,
> >>experience with these is that they're most useful when their
> >>properties are not dynamic -- which perhaps means that they
> >>post-dominate the entry, and are applied to allocas in the entry
block
> >>-- and the larger the objects in question, the more the potential
> >>stack-space savings, etc.
> >my experience adding support for the lifetime intrinsics to the rust
> >compiler is largely positive (because our code is very stack heavy at
> >the moment), but we still suffer from missed memcpy optimizations.
> >That happens because I made the lifetime regions as small as possible,
> >and sometimes an alloca starts its lifetime too late for the
optimization
> >to happen.  My new (but not yet implemented) approach to to
"align" the
> >calls to lifetime.start for allocas with overlapping lifetimes unless
> >there's actually a possibility for stack slot sharing.
> >
> >For example we currently translate:
> >
> >     let a = [0; 1000000]; // Array of 1000000 zeros
> >     {
> >       let b = a;
> >     }
> >     let c = something;
> >
> >to roughly this:
> >
> >     lifetime.start(a)
> >     memset(a, 0, 1000000)
> >     lifetime.start(b)
> >     memcpy(b, a)
> >     lifetime.end(b)
> >     lifetime.start(c)
> >     lifetime.end(c)
> >     lifetime.end(a)
> >
> >The lifetime.start call for "b" stops the call-slot (I think)
> >optimization from being applied. So instead this should be translated
to
> >something like:
> >
> >     lifetime.start(a)
> >     lifetime.start(b)
> >     memset(a, 0, 1000000)
> >     memcpy(b, a)
> >     lifetime.end(b)
> >     lifetime.start(c)
> >     lifetime.end(c)
> >     lifetime.end(a)
> >
> >extending the lifetime of "b" because it overlaps with that
of "a"
> >anyway. The lifetime of "c" still starts after the end of
"b"'s lifetime
> >because there's actually a possibility for stack slot sharing.
> >
> >Björn
> I'd be interested in seeing the IR for this that you're currently
> generating.  Unless I'm misreading your example, everything in this is
> completely dead.  We should be able to reduce this to nothing and if we
> can't, it's clearly a missed optimization.  I'm particularly
interested in
> how the difference in placement of the lifetime start for 'b'
effects
> optimization.  I really wouldn't expect that.
I should have clarified that that was a reduced, incomplete example, the
actual code looks like this (after optimizations):

  define void @_ZN9test_func20hdd8a534ccbedd903paaE(i1 zeroext) unnamed_addr #0
{
  entry-block:
    %x = alloca [100000 x i32], align 4
    %1 = bitcast [100000 x i32]* %x to i8*
    %arg = alloca [100000 x i32], align 4
    call void @llvm.lifetime.start(i64 400000, i8* %1)
    call void @llvm.memset.p0i8.i64(i8* %1, i8 0, i64 400000, i32 4, i1 false)
    %2 = bitcast [100000 x i32]* %arg to i8*
    call void @llvm.lifetime.start(i64 400000, i8* %2) ; this happens too late
    call void @llvm.memcpy.p0i8.p0i8.i64(i8* %2, i8* %1, i64 400000, i32 4, i1
false)
    call void asm "",
"r,~{dirflag},~{fpsr},~{flags}"([100000 x i32]* %arg) #2, !noalias !0,
!srcloc !3
    call void @llvm.lifetime.end(i64 400000, i8* %2) #2, !alias.scope !4,
!noalias !0
    call void @llvm.lifetime.end(i64 400000, i8* %2)
    call void @llvm.lifetime.end(i64 400000, i8* %1)
    ret void
  }

If the lifetime start for %arg is moved up, before the memset, the
callslot optimization can take place and the %c alloca is eliminated,
but with the lifetime starting after the memset, that isn't possible.

Björn

Philip Reames

2015-Feb-28 22:50 UTC

head link

[LLVMdev] RFC: PerfGuide for frontend authors

> On Feb 28, 2015, at 2:30 PM, Björn Steinbrink <bsteinbr at gmail.com>
wrote:
> 
>> On 2015.02.28 14:23:02 -0800, Philip Reames wrote:
>>> On 02/28/2015 10:04 AM, Björn Steinbrink wrote:
>>> Hi,
>>> 
>>>> On 2015.02.28 10:53:35 -0600, Hal Finkel wrote:
>>>> ----- Original Message -----
>>>>> From: "Philip Reames" <listmail at
philipreames.com>
>>>>>> 6. Use the lifetime.start/lifetime.end and
>>>>>> invariant.start/invariant.end intrinsics where possible
>>>>> Do you find these help in practice?  The few experiments I
ran were
>>>>> neutral at best and harmful in one or two cases.  Do you
have
>>>>> suggestions on how and when to use them?
>>>> Good point, we should be more specific here. My, admittedly
limited,
>>>> experience with these is that they're most useful when
their
>>>> properties are not dynamic -- which perhaps means that they
>>>> post-dominate the entry, and are applied to allocas in the
entry block
>>>> -- and the larger the objects in question, the more the
potential
>>>> stack-space savings, etc.
>>> my experience adding support for the lifetime intrinsics to the
rust
>>> compiler is largely positive (because our code is very stack heavy
at
>>> the moment), but we still suffer from missed memcpy optimizations.
>>> That happens because I made the lifetime regions as small as
possible,
>>> and sometimes an alloca starts its lifetime too late for the
optimization
>>> to happen.  My new (but not yet implemented) approach to to
"align" the
>>> calls to lifetime.start for allocas with overlapping lifetimes
unless
>>> there's actually a possibility for stack slot sharing.
>>> 
>>> For example we currently translate:
>>> 
>>>    let a = [0; 1000000]; // Array of 1000000 zeros
>>>    {
>>>      let b = a;
>>>    }
>>>    let c = something;
>>> 
>>> to roughly this:
>>> 
>>>    lifetime.start(a)
>>>    memset(a, 0, 1000000)
>>>    lifetime.start(b)
>>>    memcpy(b, a)
>>>    lifetime.end(b)
>>>    lifetime.start(c)
>>>    lifetime.end(c)
>>>    lifetime.end(a)
>>> 
>>> The lifetime.start call for "b" stops the call-slot (I
think)
>>> optimization from being applied. So instead this should be
translated to
>>> something like:
>>> 
>>>    lifetime.start(a)
>>>    lifetime.start(b)
>>>    memset(a, 0, 1000000)
>>>    memcpy(b, a)
>>>    lifetime.end(b)
>>>    lifetime.start(c)
>>>    lifetime.end(c)
>>>    lifetime.end(a)
>>> 
>>> extending the lifetime of "b" because it overlaps with
that of "a"
>>> anyway. The lifetime of "c" still starts after the end of
"b"'s lifetime
>>> because there's actually a possibility for stack slot sharing.
>>> 
>>> Björn
>> I'd be interested in seeing the IR for this that you're
currently
>> generating.  Unless I'm misreading your example, everything in this
is
>> completely dead.  We should be able to reduce this to nothing and if we
>> can't, it's clearly a missed optimization.  I'm
particularly interested in
>> how the difference in placement of the lifetime start for 'b'
effects
>> optimization.  I really wouldn't expect that.
> 
> I should have clarified that that was a reduced, incomplete example, the
> actual code looks like this (after optimizations):
> 
>  define void @_ZN9test_func20hdd8a534ccbedd903paaE(i1 zeroext) unnamed_addr
#0 {
>  entry-block:
>    %x = alloca [100000 x i32], align 4
>    %1 = bitcast [100000 x i32]* %x to i8*
>    %arg = alloca [100000 x i32], align 4
>    call void @llvm.lifetime.start(i64 400000, i8* %1)
>    call void @llvm.memset.p0i8.i64(i8* %1, i8 0, i64 400000, i32 4, i1
false)
>    %2 = bitcast [100000 x i32]* %arg to i8*
>    call void @llvm.lifetime.start(i64 400000, i8* %2) ; this happens too
late
>    call void @llvm.memcpy.p0i8.p0i8.i64(i8* %2, i8* %1, i64 400000, i32 4,
i1 false)
>    call void asm "",
"r,~{dirflag},~{fpsr},~{flags}"([100000 x i32]* %arg) #2, !noalias !0,
!srcloc !3
>    call void @llvm.lifetime.end(i64 400000, i8* %2) #2, !alias.scope !4,
!noalias !0
>    call void @llvm.lifetime.end(i64 400000, i8* %2)
>    call void @llvm.lifetime.end(i64 400000, i8* %1)
>    ret void
>  }
> 
> If the lifetime start for %arg is moved up, before the memset, the
> callslot optimization can take place and the %c alloca is eliminated,
> but with the lifetime starting after the memset, that isn't possible.This bit of ir actually seems pretty reasonable given the inline asm.  The only
thing I really see is that the memcpy could be a memset.  Are you expecting
something else?

Philip

Possibly Parallel Threads

Search for more maybe matching threads

llvm dev - Feb 2015 - [LLVMdev] RFC: PerfGuide for frontend authors

[LLVMdev] RFC: PerfGuide for frontend authors

[LLVMdev] RFC: PerfGuide for frontend authors

[LLVMdev] RFC: PerfGuide for frontend authors

Possibly Parallel Threads