thr3ads.net - llvm dev - [llvm-dev] Which pass should be propagating memory copies [May 2017]

If this information is useful, please help other people find it:
Share via:

Peter Lawrence via llvm-dev

2017-May-17 04:09 UTC

[llvm-dev] Which pass should be propagating memory copies

Keno,
          Perhaps you can view the problem to be the memcpys themselves,
We humans can look at the memcpys and see loads and stores
but to almost all optimizer passes they aren’t what it is looking for,
They instead see function calls which they mostly don’t touch,

If these memcpys were inlined into plain old loads and stores
The redundant loads and stores should be deleted by existing opts

A question I have for you is, because this looks like “copy-in-copy-out”
argument semantics,
Which to me looks more like Ada than C, what was the source language ?


Peter Lawrence.


> On May 16, 2017, at 12:00 PM, via llvm-dev <llvm-dev at
lists.llvm.org> wrote:
> 
> 
> On May 16, 2017, at 12:37 PM, Keno Fischer via llvm-dev <llvm-dev at
lists.llvm.org <mailto:llvm-dev at lists.llvm.org><mailto:llvm-dev at
lists.llvm.org <mailto:llvm-dev at lists.llvm.org>>> wrote:
> 
> Consider the following IR example:
> 
> define void @simple([4 x double] *%ptr, i64 %idx) {
>    %stack = alloca [4 x double]
>    %ptri8 = bitcast [4 x double] *%ptr to i8*
>    %stacki8 = bitcast [4 x double] *%stack to i8*
>    call void @llvm.memcpy.p0i8.p0i8.i32(i8 *%stacki8, i8 *%ptri8, i32 32,
i32 0, i1 0)
>    %dataptr = getelementptr inbounds [4 x double], [4 x double] *%ptr, i32
0, i64 %idx
>    store double 0.0, double *%dataptr
>    call void @llvm.memcpy.p0i8.p0i8.i32(i8 *%ptri8, i8 *%stacki8, i32 32,
i32 0, i1 0)
>    ret void
> }
> 
> 
> I would like to see this optimized to just a single store (into %ptr).
Right now, even at -O3 that doesn't happen. My frontend guarantees that idx
is always inbounds for the allocation, but I do think the transformation should
be valid regardless because accessing beyond the bounds of the alloca should be
undefined behavior. Now, my question is which pass should be responsible for
doing this? SROA? DSE? GVN? A new pass just to do this kind of thing? Maybe
there already is some pass that does this, just not in the default pipeline? Any
hints would be much appreciated.
> 
> Thanks,
> Keno
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170516/74788627/attachment.html>

Keno Fischer via llvm-dev

2017-May-17 15:22 UTC

head link

[llvm-dev] Which pass should be propagating memory copies

On Wed, May 17, 2017 at 12:09 AM, Peter Lawrence via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
> Keno,
>           Perhaps you can view the problem to be the memcpys themselves,
> We humans can look at the memcpys and see loads and stores
> but to almost all optimizer passes they aren’t what it is looking for,
> They instead see function calls which they mostly don’t touch,
>
> If these memcpys were inlined into plain old loads and stores
> The redundant loads and stores should be deleted by existing opts
>
> A question I have for you is, because this looks like “copy-in-copy-out”
> argument semantics,
> Which to me looks more like Ada than C, what was the source language ?
>
>
> Peter Lawrence.
>
No, I very much want the memcpys there. With individual stores I'd give up
hope that the optimizer can figure out what's going on here, esp. if it
gets beyond a few bytes, but I with memcpys it does seem doable. As for
which frontend produced this, we're considering adding language semantics
that would produce lots of code like this to julia, so we're looking into
getting the optimizer to fold the extra copies away.

Keno
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170517/83a61c7a/attachment.html>

Keno Fischer via llvm-dev

2017-May-17 15:55 UTC

head link

[llvm-dev] Which pass should be propagating memory copies

Well, mostly I want to hoist the store to the stack and transform it into a
store to the heap. After that the memcpys are essentially trivially dead,
so instcombine or dse will delete them for me. If the memcpys were made of
individual stores instead, there'd have to be some sort of exponential
search somewhere in the compiler to figure that out. For the extreme case
consider [100000000 x double]. The same optimization can apply here, but if
it tried to do 100000000 stores instead, I wouldn't expect the compiler to
really figure that out. What I meant was that I think the memcpys are the
correct representation of this from the frontend, it's just that I'd
like
more optimization to happen here.


On Wed, May 17, 2017 at 11:48 AM, Peter Lawrence <peterl95124 at
sbcglobal.net>
wrote:
> Keno,
>           "No, I very much want the memcpys there” seems like a
> contradiction,
> Aren’t you trying to optimize away the memcpys.
> Peter Lawrence
>
>
>
>
>
>
> On May 17, 2017, at 8:22 AM, Keno Fischer <keno at
juliacomputing.com> wrote:
>
>
> On Wed, May 17, 2017 at 12:09 AM, Peter Lawrence via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
>> Keno,
>>           Perhaps you can view the problem to be the memcpys
themselves,
>> We humans can look at the memcpys and see loads and stores
>> but to almost all optimizer passes they aren’t what it is looking for,
>> They instead see function calls which they mostly don’t touch,
>>
>> If these memcpys were inlined into plain old loads and stores
>> The redundant loads and stores should be deleted by existing opts
>>
>> A question I have for you is, because this looks like
“copy-in-copy-out”
>> argument semantics,
>> Which to me looks more like Ada than C, what was the source language ?
>>
>>
>> Peter Lawrence.
>>
>
> No, I very much want the memcpys there. With individual stores I'd give
up
> hope that the optimizer can figure out what's going on here, esp. if it
> gets beyond a few bytes, but I with memcpys it does seem doable. As for
> which frontend produced this, we're considering adding language
semantics
> that would produce lots of code like this to julia, so we're looking
into
> getting the optimizer to fold the extra copies away.
>
> Keno
>
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170517/14b0b699/attachment.html>

Keno Fischer via llvm-dev

2017-May-17 15:58 UTC

head link

[llvm-dev] Which pass should be propagating memory copies

Argh, I'm sorry. I realize the original IR has the most unfortunate typo.
What I meant was
that the intermediate store should have happened on the stack, i.e.

 %dataptr = getelementptr inbounds [4 x double], [4 x double] *%stack, i32
0, i64 %idx

I'm sorry for any confusion that caused.

On Wed, May 17, 2017 at 11:55 AM, Keno Fischer <keno at
juliacomputing.com>
wrote:
> Well, mostly I want to hoist the store to the stack and transform it into
> a store to the heap. After that the memcpys are essentially trivially dead,
> so instcombine or dse will delete them for me. If the memcpys were made of
> individual stores instead, there'd have to be some sort of exponential
> search somewhere in the compiler to figure that out. For the extreme case
> consider [100000000 x double]. The same optimization can apply here, but if
> it tried to do 100000000 stores instead, I wouldn't expect the compiler
to
> really figure that out. What I meant was that I think the memcpys are the
> correct representation of this from the frontend, it's just that
I'd like
> more optimization to happen here.
>
>
> On Wed, May 17, 2017 at 11:48 AM, Peter Lawrence <
> peterl95124 at sbcglobal.net> wrote:
>
>> Keno,
>>           "No, I very much want the memcpys there” seems like a
>> contradiction,
>> Aren’t you trying to optimize away the memcpys.
>> Peter Lawrence
>>
>>
>>
>>
>>
>>
>> On May 17, 2017, at 8:22 AM, Keno Fischer <keno at
juliacomputing.com>
>> wrote:
>>
>>
>> On Wed, May 17, 2017 at 12:09 AM, Peter Lawrence via llvm-dev <
>> llvm-dev at lists.llvm.org> wrote:
>>
>>> Keno,
>>>           Perhaps you can view the problem to be the memcpys
themselves,
>>> We humans can look at the memcpys and see loads and stores
>>> but to almost all optimizer passes they aren’t what it is looking
for,
>>> They instead see function calls which they mostly don’t touch,
>>>
>>> If these memcpys were inlined into plain old loads and stores
>>> The redundant loads and stores should be deleted by existing opts
>>>
>>> A question I have for you is, because this looks like
“copy-in-copy-out”
>>> argument semantics,
>>> Which to me looks more like Ada than C, what was the source
language ?
>>>
>>>
>>> Peter Lawrence.
>>>
>>
>> No, I very much want the memcpys there. With individual stores I'd
give
>> up hope that the optimizer can figure out what's going on here,
esp. if it
>> gets beyond a few bytes, but I with memcpys it does seem doable. As for
>> which frontend produced this, we're considering adding language
semantics
>> that would produce lots of code like this to julia, so we're
looking into
>> getting the optimizer to fold the extra copies away.
>>
>> Keno
>>
>>
>>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170517/fa24a3fc/attachment.html>

Peter Lawrence via llvm-dev

2017-May-17 17:28 UTC

head link

[llvm-dev] Which pass should be propagating memory copies

Keno,
         Hmmm, seems like you are saying “copy-in-copy-out” argument semantics
are required by the language,
Otherwise you would be using “by-reference" argument semantics,
And that CICO is easiest for you to represent with memcpy.

Usually there are some very subtle issues with CICO and the memory model,
Typically the original object isn’t supposed to be modified until the function
returns,
IE multiple stores, especially of different values, to a field in the original
object should not be visible, only the final store,
This is clearly “observable" in multithreaded programs, but can also be
observable in a single threaded program
If the same object is visible from within the called function for example as a
global variable
Which would be seen to have its internal values change multiple times, even
though the
Intent of the language using CICO is to try to ensure all-at-once state changes
(at least for single-threaded programs)


My advice, if the above applies to you, is to add a new pass to the compiler
that figures out if
The transformation from memcpy to explicit multiple load/store is actually legal
(won’t produce intermediate
State changes before the exit of the function which would violate the strict
CICO calling convention),
And also profitable (I don’t view the code explosion of [1000000 x double] as
profitable!),
Or if the transformation from “CICO" to pure “by-reference” is both legal,
and profitable.

(Also, don’t forget to check what the language spec says about this function
passing the object,
Or parts of it, to other functions before or after making modifications)


My advice regarding teaching GVN about memcpy is not to. It would be one thing
if the memcpy
Were copying in/out a single variable, in that case the memcpy can and should be
viewed as a load / store pair,
But in your case it isn’t being used that way, it is being used to copy multiple
values, and the only
Logical thing that GVN could do is expand those out to multiple individual loads
and stores. GVN should not
Be doing this, instead your new pass (that first checks to see if it is legal
w.r.t. calling convention) is
The place to do this, or if should convert to pure “by-reference” if legal,
which also shouldn’t be done in GVN.


—Peter Lawrence.


> On May 17, 2017, at 8:55 AM, Keno Fischer <keno at
juliacomputing.com> wrote:
> 
> Well, mostly I want to hoist the store to the stack and transform it into a
store to the heap. After that the memcpys are essentially trivially dead, so
instcombine or dse will delete them for me. If the memcpys were made of
individual stores instead, there'd have to be some sort of exponential
search somewhere in the compiler to figure that out. For the extreme case
consider [100000000 x double]. The same optimization can apply here, but if it
tried to do 100000000 stores instead, I wouldn't expect the compiler to
really figure that out. What I meant was that I think the memcpys are the
correct representation of this from the frontend, it's just that I'd
like more optimization to happen here.
> 
> 
> On Wed, May 17, 2017 at 11:48 AM, Peter Lawrence <peterl95124 at
sbcglobal.net <mailto:peterl95124 at sbcglobal.net>> wrote:
> Keno,
>           "No, I very much want the memcpys there” seems like a
contradiction,
> Aren’t you trying to optimize away the memcpys.
> Peter Lawrence
> 
> 
> 
> 
> 
> 
>> On May 17, 2017, at 8:22 AM, Keno Fischer <keno at
juliacomputing.com <mailto:keno at juliacomputing.com>> wrote:
>> 
>> 
>> On Wed, May 17, 2017 at 12:09 AM, Peter Lawrence via llvm-dev
<llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>>
wrote:
>> Keno,
>>           Perhaps you can view the problem to be the memcpys
themselves,
>> We humans can look at the memcpys and see loads and stores
>> but to almost all optimizer passes they aren’t what it is looking for,
>> They instead see function calls which they mostly don’t touch,
>> 
>> If these memcpys were inlined into plain old loads and stores
>> The redundant loads and stores should be deleted by existing opts
>> 
>> A question I have for you is, because this looks like
“copy-in-copy-out” argument semantics,
>> Which to me looks more like Ada than C, what was the source language ?
>> 
>> 
>> Peter Lawrence.
>> 
>> No, I very much want the memcpys there. With individual stores I'd
give up hope that the optimizer can figure out what's going on here, esp. if
it gets beyond a few bytes, but I with memcpys it does seem doable. As for which
frontend produced this, we're considering adding language semantics that
would produce lots of code like this to julia, so we're looking into getting
the optimizer to fold the extra copies away.
>> 
>> Keno
> 
> 
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170517/8c4eb46a/attachment.html>

llvm dev - May 2017 - Which pass should be propagating memory copies

[llvm-dev] Which pass should be propagating memory copies

[llvm-dev] Which pass should be propagating memory copies

[llvm-dev] Which pass should be propagating memory copies

[llvm-dev] Which pass should be propagating memory copies

[llvm-dev] Which pass should be propagating memory copies