thr3ads.net - llvm dev - [LLVMdev] RFC: PerfGuide for frontend authors [Mar 2015]

If this information is useful, please help other people find it:
Share via:

Philip Reames

2015-Mar-01 17:27 UTC

[LLVMdev] RFC: PerfGuide for frontend authors

> On Mar 1, 2015, at 7:53 AM, Björn Steinbrink <bsteinbr at gmail.com>
wrote:
> 
> On 2015.02.28 18:17:27 -0800, Philip Reames wrote:
>>> On Feb 28, 2015, at 3:01 PM, Björn Steinbrink <bsteinbr at
gmail.com> wrote:
>>> 2015-02-28 23:50 GMT+01:00 Philip Reames <listmail at
philipreames.com>:
>>>>>> On Feb 28, 2015, at 2:30 PM, Björn Steinbrink
<bsteinbr at gmail.com> wrote:
>>>>> I should have clarified that that was a reduced, incomplete
example, the
>>>>> actual code looks like this (after optimizations):
>>>>> 
>>>>> define void @_ZN9test_func20hdd8a534ccbedd903paaE(i1
zeroext) unnamed_addr #0 {
>>>>> entry-block:
>>>>>  %x = alloca [100000 x i32], align 4
>>>>>  %1 = bitcast [100000 x i32]* %x to i8*
>>>>>  %arg = alloca [100000 x i32], align 4
>>>>>  call void @llvm.lifetime.start(i64 400000, i8* %1)
>>>>>  call void @llvm.memset.p0i8.i64(i8* %1, i8 0, i64 400000,
i32 4, i1 false)
>>>>>  %2 = bitcast [100000 x i32]* %arg to i8*
>>>>>  call void @llvm.lifetime.start(i64 400000, i8* %2) ; this
happens too late
>>>>>  call void @llvm.memcpy.p0i8.p0i8.i64(i8* %2, i8* %1, i64
400000, i32 4, i1 false)
>>>>>  call void asm "",
"r,~{dirflag},~{fpsr},~{flags}"([100000 x i32]* %arg) #2, !noalias !0,
!srcloc !3
>>>>>  call void @llvm.lifetime.end(i64 400000, i8* %2) #2,
!alias.scope !4, !noalias !0
>>>>>  call void @llvm.lifetime.end(i64 400000, i8* %2)
>>>>>  call void @llvm.lifetime.end(i64 400000, i8* %1)
>>>>>  ret void
>>>>> }
>>>>> 
>>>>> If the lifetime start for %arg is moved up, before the
memset, the
>>>>> callslot optimization can take place and the %c alloca is
eliminated,
>>>>> but with the lifetime starting after the memset, that
isn't possible.
>>>> This bit of ir actually seems pretty reasonable given the
inline asm.  The only thing I really see is that the memcpy could be a memset. 
Are you expecting something else?
>>> 
>>> The only thing that is to be improved is that the memset should
>>> directly write to %arg and %x should be removed because it is dead
>>> then. This happens when there are no lifetime intrinsics or when
the
>>> call to lifetime.start is moved before the call to memset. The
latter
>>> is what my first mail was about, that it is usually better to have
>>> overlapping lifetimes all start at the same point, instead of
starting
>>> them as late as possible.
>> Honestly, this sounds like a clear optimizer bug, not something a
frontend should work around.
>> 
>> Can you file a bug with the four sets of ir?  (Both schedules, no
intrinsics before and after). This should hopefully be easy to fix.
> 
> I went ahead and made a fix: http://reviews.llvm.org/D7984Fyi, I'm pretty sure this is the wrong approach. I need to confirm by
running with some test cases tomorrow, but if my initial analysis holds up, the
lifetime start may be just a red herring.  The underlying issue appears to be a
canonicalization issue with multiple users of an alloca with conflicting types
and the placement of the bitcast used by the lifetime instruction.
> 
>> Do you know of other cases like this with the lifetime intrinsics?
> 
> Not offhand, but I didn't check the IR closely for such issues. The
> memcpy thing was found by chance, too.
> 
> Björn

Björn Steinbrink

2015-Mar-01 17:35 UTC

head link

[LLVMdev] RFC: PerfGuide for frontend authors

On 2015.03.01 09:27:02 -0800, Philip Reames wrote:> 
> 
> 
> 
> > On Mar 1, 2015, at 7:53 AM, Björn Steinbrink <bsteinbr at
gmail.com> wrote:
> > 
> > On 2015.02.28 18:17:27 -0800, Philip Reames wrote:
> >>> On Feb 28, 2015, at 3:01 PM, Björn Steinbrink <bsteinbr at
gmail.com> wrote:
> >>> 2015-02-28 23:50 GMT+01:00 Philip Reames <listmail at
philipreames.com>:
> >>>>>> On Feb 28, 2015, at 2:30 PM, Björn Steinbrink
<bsteinbr at gmail.com> wrote:
> >>>>> I should have clarified that that was a reduced,
incomplete example, the
> >>>>> actual code looks like this (after optimizations):
> >>>>> 
> >>>>> define void @_ZN9test_func20hdd8a534ccbedd903paaE(i1
zeroext) unnamed_addr #0 {
> >>>>> entry-block:
> >>>>>  %x = alloca [100000 x i32], align 4
> >>>>>  %1 = bitcast [100000 x i32]* %x to i8*
> >>>>>  %arg = alloca [100000 x i32], align 4
> >>>>>  call void @llvm.lifetime.start(i64 400000, i8* %1)
> >>>>>  call void @llvm.memset.p0i8.i64(i8* %1, i8 0, i64
400000, i32 4, i1 false)
> >>>>>  %2 = bitcast [100000 x i32]* %arg to i8*
> >>>>>  call void @llvm.lifetime.start(i64 400000, i8* %2) ;
this happens too late
> >>>>>  call void @llvm.memcpy.p0i8.p0i8.i64(i8* %2, i8* %1,
i64 400000, i32 4, i1 false)
> >>>>>  call void asm "",
"r,~{dirflag},~{fpsr},~{flags}"([100000 x i32]* %arg) #2, !noalias !0,
!srcloc !3
> >>>>>  call void @llvm.lifetime.end(i64 400000, i8* %2) #2,
!alias.scope !4, !noalias !0
> >>>>>  call void @llvm.lifetime.end(i64 400000, i8* %2)
> >>>>>  call void @llvm.lifetime.end(i64 400000, i8* %1)
> >>>>>  ret void
> >>>>> }
> >>>>> 
> >>>>> If the lifetime start for %arg is moved up, before the
memset, the
> >>>>> callslot optimization can take place and the %c alloca
is eliminated,
> >>>>> but with the lifetime starting after the memset, that
isn't possible.
> >>>> This bit of ir actually seems pretty reasonable given the
inline asm.  The only thing I really see is that the memcpy could be a memset. 
Are you expecting something else?
> >>> 
> >>> The only thing that is to be improved is that the memset
should
> >>> directly write to %arg and %x should be removed because it is
dead
> >>> then. This happens when there are no lifetime intrinsics or
when the
> >>> call to lifetime.start is moved before the call to memset. The
latter
> >>> is what my first mail was about, that it is usually better to
have
> >>> overlapping lifetimes all start at the same point, instead of
starting
> >>> them as late as possible.
> >> Honestly, this sounds like a clear optimizer bug, not something a
frontend should work around.
> >> 
> >> Can you file a bug with the four sets of ir?  (Both schedules, no
intrinsics before and after). This should hopefully be easy to fix.
> > 
> > I went ahead and made a fix: http://reviews.llvm.org/D7984
> Fyi, I'm pretty sure this is the wrong approach. I need to confirm by
> running with some test cases tomorrow, but if my initial analysis
> holds up, the lifetime start may be just a red herring.  The
> underlying issue appears to be a canonicalization issue with multiple
> users of an alloca with conflicting types and the placement of the
> bitcast used by the lifetime instruction.
At least for the callslot optimization, the lifetime.start seems to be
key. The call to memset isn't moved, so performing the optimization
without moving the call to lifetime.start would result in the memset
happening before the lifetime.start which means that it qualifies as a
dead store and may be removed.

Are you implying that there is some other optimization that should
re-order the calls to memset and lifetime.start?

Björn

Philip Reames

2015-Mar-01 17:54 UTC

head link

[LLVMdev] RFC: PerfGuide for frontend authors

On 03/01/2015 09:35 AM, Björn Steinbrink wrote:> On 2015.03.01 09:27:02 -0800, Philip Reames wrote:
>>
>>
>>
>>> On Mar 1, 2015, at 7:53 AM, Björn Steinbrink <bsteinbr at
gmail.com> wrote:
>>>
>>> On 2015.02.28 18:17:27 -0800, Philip Reames wrote:
>>>>> On Feb 28, 2015, at 3:01 PM, Björn Steinbrink <bsteinbr
at gmail.com> wrote:
>>>>> 2015-02-28 23:50 GMT+01:00 Philip Reames <listmail at
philipreames.com>:
>>>>>>>> On Feb 28, 2015, at 2:30 PM, Björn Steinbrink
<bsteinbr at gmail.com> wrote:
>>>>>>> I should have clarified that that was a reduced,
incomplete example, the
>>>>>>> actual code looks like this (after optimizations):
>>>>>>>
>>>>>>> define void
@_ZN9test_func20hdd8a534ccbedd903paaE(i1 zeroext) unnamed_addr #0 {
>>>>>>> entry-block:
>>>>>>>   %x = alloca [100000 x i32], align 4
>>>>>>>   %1 = bitcast [100000 x i32]* %x to i8*
>>>>>>>   %arg = alloca [100000 x i32], align 4
>>>>>>>   call void @llvm.lifetime.start(i64 400000, i8*
%1)
>>>>>>>   call void @llvm.memset.p0i8.i64(i8* %1, i8 0, i64
400000, i32 4, i1 false)
>>>>>>>   %2 = bitcast [100000 x i32]* %arg to i8*
>>>>>>>   call void @llvm.lifetime.start(i64 400000, i8*
%2) ; this happens too late
>>>>>>>   call void @llvm.memcpy.p0i8.p0i8.i64(i8* %2, i8*
%1, i64 400000, i32 4, i1 false)
>>>>>>>   call void asm "",
"r,~{dirflag},~{fpsr},~{flags}"([100000 x i32]* %arg) #2, !noalias !0,
!srcloc !3
>>>>>>>   call void @llvm.lifetime.end(i64 400000, i8* %2)
#2, !alias.scope !4, !noalias !0
>>>>>>>   call void @llvm.lifetime.end(i64 400000, i8* %2)
>>>>>>>   call void @llvm.lifetime.end(i64 400000, i8* %1)
>>>>>>>   ret void
>>>>>>> }
>>>>>>>
>>>>>>> If the lifetime start for %arg is moved up, before
the memset, the
>>>>>>> callslot optimization can take place and the %c
alloca is eliminated,
>>>>>>> but with the lifetime starting after the memset,
that isn't possible.
>>>>>> This bit of ir actually seems pretty reasonable given
the inline asm.  The only thing I really see is that the memcpy could be a
memset.  Are you expecting something else?
>>>>> The only thing that is to be improved is that the memset
should
>>>>> directly write to %arg and %x should be removed because it
is dead
>>>>> then. This happens when there are no lifetime intrinsics or
when the
>>>>> call to lifetime.start is moved before the call to memset.
The latter
>>>>> is what my first mail was about, that it is usually better
to have
>>>>> overlapping lifetimes all start at the same point, instead
of starting
>>>>> them as late as possible.
>>>> Honestly, this sounds like a clear optimizer bug, not something
a frontend should work around.
>>>>
>>>> Can you file a bug with the four sets of ir?  (Both schedules,
no intrinsics before and after). This should hopefully be easy to fix.
>>> I went ahead and made a fix: http://reviews.llvm.org/D7984
>> Fyi, I'm pretty sure this is the wrong approach. I need to confirm
by
>> running with some test cases tomorrow, but if my initial analysis
>> holds up, the lifetime start may be just a red herring.  The
>> underlying issue appears to be a canonicalization issue with multiple
>> users of an alloca with conflicting types and the placement of the
>> bitcast used by the lifetime instruction.
> At least for the callslot optimization, the lifetime.start seems to be
> key. The call to memset isn't moved, so performing the optimization
> without moving the call to lifetime.start would result in the memset
> happening before the lifetime.start which means that it qualifies as a
> dead store and may be removed.I actually ran my tests.  I was wrong.  My analysis of code apparently 
lead me astray.

I'll take another closer look at your patch tomorrow.


Philip

llvm dev - Mar 2015 - [LLVMdev] RFC: PerfGuide for frontend authors

[LLVMdev] RFC: PerfGuide for frontend authors

[LLVMdev] RFC: PerfGuide for frontend authors

[LLVMdev] RFC: PerfGuide for frontend authors