thr3ads.net - llvm dev - [llvm-dev] Which pass should be propagating memory copies [May 2017]

If this information is useful, please help other people find it:
Share via:

Peter Lawrence via llvm-dev

2017-May-17 17:28 UTC

[llvm-dev] Which pass should be propagating memory copies

Keno,
         Hmmm, seems like you are saying “copy-in-copy-out” argument semantics
are required by the language,
Otherwise you would be using “by-reference" argument semantics,
And that CICO is easiest for you to represent with memcpy.

Usually there are some very subtle issues with CICO and the memory model,
Typically the original object isn’t supposed to be modified until the function
returns,
IE multiple stores, especially of different values, to a field in the original
object should not be visible, only the final store,
This is clearly “observable" in multithreaded programs, but can also be
observable in a single threaded program
If the same object is visible from within the called function for example as a
global variable
Which would be seen to have its internal values change multiple times, even
though the
Intent of the language using CICO is to try to ensure all-at-once state changes
(at least for single-threaded programs)


My advice, if the above applies to you, is to add a new pass to the compiler
that figures out if
The transformation from memcpy to explicit multiple load/store is actually legal
(won’t produce intermediate
State changes before the exit of the function which would violate the strict
CICO calling convention),
And also profitable (I don’t view the code explosion of [1000000 x double] as
profitable!),
Or if the transformation from “CICO" to pure “by-reference” is both legal,
and profitable.

(Also, don’t forget to check what the language spec says about this function
passing the object,
Or parts of it, to other functions before or after making modifications)


My advice regarding teaching GVN about memcpy is not to. It would be one thing
if the memcpy
Were copying in/out a single variable, in that case the memcpy can and should be
viewed as a load / store pair,
But in your case it isn’t being used that way, it is being used to copy multiple
values, and the only
Logical thing that GVN could do is expand those out to multiple individual loads
and stores. GVN should not
Be doing this, instead your new pass (that first checks to see if it is legal
w.r.t. calling convention) is
The place to do this, or if should convert to pure “by-reference” if legal,
which also shouldn’t be done in GVN.


—Peter Lawrence.


> On May 17, 2017, at 8:55 AM, Keno Fischer <keno at
juliacomputing.com> wrote:
> 
> Well, mostly I want to hoist the store to the stack and transform it into a
store to the heap. After that the memcpys are essentially trivially dead, so
instcombine or dse will delete them for me. If the memcpys were made of
individual stores instead, there'd have to be some sort of exponential
search somewhere in the compiler to figure that out. For the extreme case
consider [100000000 x double]. The same optimization can apply here, but if it
tried to do 100000000 stores instead, I wouldn't expect the compiler to
really figure that out. What I meant was that I think the memcpys are the
correct representation of this from the frontend, it's just that I'd
like more optimization to happen here.
> 
> 
> On Wed, May 17, 2017 at 11:48 AM, Peter Lawrence <peterl95124 at
sbcglobal.net <mailto:peterl95124 at sbcglobal.net>> wrote:
> Keno,
>           "No, I very much want the memcpys there” seems like a
contradiction,
> Aren’t you trying to optimize away the memcpys.
> Peter Lawrence
> 
> 
> 
> 
> 
> 
>> On May 17, 2017, at 8:22 AM, Keno Fischer <keno at
juliacomputing.com <mailto:keno at juliacomputing.com>> wrote:
>> 
>> 
>> On Wed, May 17, 2017 at 12:09 AM, Peter Lawrence via llvm-dev
<llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>>
wrote:
>> Keno,
>>           Perhaps you can view the problem to be the memcpys
themselves,
>> We humans can look at the memcpys and see loads and stores
>> but to almost all optimizer passes they aren’t what it is looking for,
>> They instead see function calls which they mostly don’t touch,
>> 
>> If these memcpys were inlined into plain old loads and stores
>> The redundant loads and stores should be deleted by existing opts
>> 
>> A question I have for you is, because this looks like
“copy-in-copy-out” argument semantics,
>> Which to me looks more like Ada than C, what was the source language ?
>> 
>> 
>> Peter Lawrence.
>> 
>> No, I very much want the memcpys there. With individual stores I'd
give up hope that the optimizer can figure out what's going on here, esp. if
it gets beyond a few bytes, but I with memcpys it does seem doable. As for which
frontend produced this, we're considering adding language semantics that
would produce lots of code like this to julia, so we're looking into getting
the optimizer to fold the extra copies away.
>> 
>> Keno
> 
> 
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170517/8c4eb46a/attachment.html>

Keno Fischer via llvm-dev

2017-May-18 15:47 UTC

head link

[llvm-dev] Which pass should be propagating memory copies

Hi Peter,

thank you for concern and advice. Since we both write the compiler and
design the language, we are not particularly
bound by any pre-existing spec. The concerns about multi-threaded data
races are very relevant of course
and we're well aware of the implications. In the particular case where this
comes up, language semantics
generally guarantee that this is unobservable both in single-threaded and
multi-threaded contexts (though
we generally do allow the user to shoot themselves in the foot if they want
to, the primary concern here
is not really observability, but what the programmer expects from the
semantics of the language). For what
it's worth, this isn't exactly CICO. Our calling convention is generally
by
reference. However, we do have
notions of semantic immutability, which is where this particular pattern
arises (in cases where a new immutable
gets created by taking an existing field and modifying it in one place
only). Because of these semantic
guarantees, we know that there's no aliasing of the kind that would be
problematic (and expose this
information to LLVM through the various AA mechanisms). Now, similar issues
of course arise with mutable
memory locations as well. However, in such cases the data race would be
explicitly present in the source
program, so we don't have a problem with the compiler making this
optimization. FWIW, our multi-threading
programming model is in the early stages, and we're considering various
language level constraints
on concurrent data modification to mostly disallow that situation unless
explicitly opted in to by the user,
but that's a bit off.
>From my perspective, I don't see a reason why GVN shouldn't be doing
this(which is why I sent the original
email in the first place). It would of course be very possible for us to
write our own pass that pattern matches
this and performs the transformation that we want. However, we generally
tend to prefer working with the
community to put the optimizations in the best possible place such that
others may automatically take advantage.
It sounds like the community consensus is that GVN should be able to do
this kind of optimization (and thanks
to Daniel for providing some guidance on implementation!). If people feel
strongly that memcpyopt (or a new pass)
with appropriate pattern matching would be a better place, I'd be happy to
go that way as well of course.

Keno


On Wed, May 17, 2017 at 1:28 PM, Peter Lawrence <peterl95124 at
sbcglobal.net>
wrote:
> Keno,
>          Hmmm, seems like you are saying “copy-in-copy-out” argument
> semantics are required by the language,
> Otherwise you would be using “by-reference" argument semantics,
> And that CICO is easiest for you to represent with memcpy.
>
> Usually there are some very subtle issues with CICO and the memory model,
> Typically the original object isn’t supposed to be modified until the
> function returns,
> IE multiple stores, especially of different values, to a field in the
> original object should not be visible, only the final store,
> This is clearly “observable" in multithreaded programs, but can also
be
> observable in a single threaded program
> If the same object is visible from within the called function for example
> as a global variable
> Which would be seen to have its internal values change multiple times,
> even though the
> Intent of the language using CICO is to try to ensure all-at-once state
> changes (at least for single-threaded programs)
>
>
> My advice, if the above applies to you, is to add a new pass to the
> compiler that figures out if
> The transformation from memcpy to explicit multiple load/store is actually
> legal (won’t produce intermediate
> State changes before the exit of the function which would violate the
> strict CICO calling convention),
> And also profitable (I don’t view the code explosion of [1000000 x double]
> as profitable!),
> Or if the transformation from “CICO" to pure “by-reference” is both
legal,
> and profitable.
>
> (Also, don’t forget to check what the language spec says about this
> function passing the object,
> Or parts of it, to other functions before or after making modifications)
>
>
> My advice regarding teaching GVN about memcpy is not to. It would be one
> thing if the memcpy
> Were copying in/out a single variable, in that case the memcpy can and
> should be viewed as a load / store pair,
> But in your case it isn’t being used that way, it is being used to copy
> multiple values, and the only
> Logical thing that GVN could do is expand those out to multiple individual
> loads and stores. GVN should not
> Be doing this, instead your new pass (that first checks to see if it is
> legal w.r.t. calling convention) is
> The place to do this, or if should convert to pure “by-reference” if
> legal, which also shouldn’t be done in GVN.
>
>
> —Peter Lawrence.
>
>
>
> On May 17, 2017, at 8:55 AM, Keno Fischer <keno at
juliacomputing.com> wrote:
>
> Well, mostly I want to hoist the store to the stack and transform it into
> a store to the heap. After that the memcpys are essentially trivially dead,
> so instcombine or dse will delete them for me. If the memcpys were made of
> individual stores instead, there'd have to be some sort of exponential
> search somewhere in the compiler to figure that out. For the extreme case
> consider [100000000 x double]. The same optimization can apply here, but if
> it tried to do 100000000 stores instead, I wouldn't expect the compiler
to
> really figure that out. What I meant was that I think the memcpys are the
> correct representation of this from the frontend, it's just that
I'd like
> more optimization to happen here.
>
>
> On Wed, May 17, 2017 at 11:48 AM, Peter Lawrence <
> peterl95124 at sbcglobal.net> wrote:
>
>> Keno,
>>           "No, I very much want the memcpys there” seems like a
>> contradiction,
>> Aren’t you trying to optimize away the memcpys.
>> Peter Lawrence
>>
>>
>>
>>
>>
>>
>> On May 17, 2017, at 8:22 AM, Keno Fischer <keno at
juliacomputing.com>
>> wrote:
>>
>>
>> On Wed, May 17, 2017 at 12:09 AM, Peter Lawrence via llvm-dev <
>> llvm-dev at lists.llvm.org> wrote:
>>
>>> Keno,
>>>           Perhaps you can view the problem to be the memcpys
themselves,
>>> We humans can look at the memcpys and see loads and stores
>>> but to almost all optimizer passes they aren’t what it is looking
for,
>>> They instead see function calls which they mostly don’t touch,
>>>
>>> If these memcpys were inlined into plain old loads and stores
>>> The redundant loads and stores should be deleted by existing opts
>>>
>>> A question I have for you is, because this looks like
“copy-in-copy-out”
>>> argument semantics,
>>> Which to me looks more like Ada than C, what was the source
language ?
>>>
>>>
>>> Peter Lawrence.
>>>
>>
>> No, I very much want the memcpys there. With individual stores I'd
give
>> up hope that the optimizer can figure out what's going on here,
esp. if it
>> gets beyond a few bytes, but I with memcpys it does seem doable. As for
>> which frontend produced this, we're considering adding language
semantics
>> that would produce lots of code like this to julia, so we're
looking into
>> getting the optimizer to fold the extra copies away.
>>
>> Keno
>>
>>
>>
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170518/1bb4e731/attachment.html>

Peter Lawrence via llvm-dev

2017-May-19 15:46 UTC

head link

[llvm-dev] Which pass should be propagating memory copies

Keno,
          I suspect that if you, Daniel B., and I were to have an in person
meeting this would take 5~10 minutes
For everyone to (in terms D.B. will appreciate) “converge to a fixed point”
(:-)! of understanding. Meanwhile we
are stuck with limited email bandwidth, but a meet-up at the next llvm Bay Area
social might be a good idea.

Whether GVN is an appropriate place for this opt is going to hinge on the
precise details of the calling
Convention, which you are still being vague about. You are simultaneously saying
1) the memcpy are necessary
2) the memcpy can and should be opt away
But these two statements are mutually exclusive, so you need to be more precise
about the CC.

Why is this important?, because even if Daniel B. can enhance GVN to “look
through” the memcpys
And optimize the loads from the local-stack-copy into loads through the original
pointer argument,
And optimize the stores into the local-stack-copy into stores through the
original pointer argument,
There is still the issue of deleting the memcpys themselves, which is the actual
performance problem.

But the rules of the C/C++ programming language aren’t typically going to allow
these deletions,
For example if the original pointer argument is passed to another function, or
the address of any
Or all of the local stack copy are passed to another function, or simply calling
*any* function because
It could by the C/C++ rules modify the original data, requiring the memcpys to
be preserved.

Also the same logical arguments apply to the loads and stores that Daniel B.
thinks he can optimize
In GVN, it depends on where they occur relative to calls to other functions
within this function.

The only thing that allows the deletion of the memcpys is intimate knowledge of
the Julia-specific
Calling convention. Again similar conclusions apply to even the loads and
stores.

And that, IMHO, is inappropriate to include in GVN, which is otherwise a purely
C/C++ optimizer,
So a separate Julia calling convention pass is indicated. 

PS, Don’t be intimidated by writing an IR-to-IR pass, I’ve already written one,
they are easy.
Yours will be particularly easy (after verifying the transform is legal) as it
is just a “replace-all-
Uses-of” which already exists, deleting the memcpys, and finally deleting the
stack object.

Peter Lawrence.

> On May 18, 2017, at 8:47 AM, Keno Fischer <keno at
juliacomputing.com> wrote:
> 
> Hi Peter,
> 
> thank you for concern and advice. Since we both write the compiler and
design the language, we are not particularly
> bound by any pre-existing spec. The concerns about multi-threaded data
races are very relevant of course
> and we're well aware of the implications. In the particular case where
this comes up, language semantics
> generally guarantee that this is unobservable both in single-threaded and
multi-threaded contexts (though
> we generally do allow the user to shoot themselves in the foot if they want
to, the primary concern here
> is not really observability, but what the programmer expects from the
semantics of the language). For what
> it's worth, this isn't exactly CICO. Our calling convention is
generally by reference. However, we do have
> notions of semantic immutability, which is where this particular pattern
arises (in cases where a new immutable
> gets created by taking an existing field and modifying it in one place
only). Because of these semantic
> guarantees, we know that there's no aliasing of the kind that would be
problematic (and expose this
> information to LLVM through the various AA mechanisms). Now, similar issues
of course arise with mutable
> memory locations as well. However, in such cases the data race would be
explicitly present in the source
> program, so we don't have a problem with the compiler making this
optimization. FWIW, our multi-threading
> programming model is in the early stages, and we're considering various
language level constraints
> on concurrent data modification to mostly disallow that situation unless
explicitly opted in to by the user,
> but that's a bit off. 
> 
> From my perspective, I don't see a reason why GVN shouldn't be
doing this (which is why I sent the original
> email in the first place). It would of course be very possible for us to
write our own pass that pattern matches
> this and performs the transformation that we want. However, we generally
tend to prefer working with the
> community to put the optimizations in the best possible place such that
others may automatically take advantage.
> It sounds like the community consensus is that GVN should be able to do
this kind of optimization (and thanks
> to Daniel for providing some guidance on implementation!). If people feel
strongly that memcpyopt (or a new pass)
> with appropriate pattern matching would be a better place, I'd be happy
to go that way as well of course.
> 
> Keno
> 
> 
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170519/c80c9c96/attachment.html>

llvm dev - May 2017 - Which pass should be propagating memory copies

[llvm-dev] Which pass should be propagating memory copies

[llvm-dev] Which pass should be propagating memory copies

[llvm-dev] Which pass should be propagating memory copies