thr3ads.net - llvm dev - [llvm-dev] Memory scope proposal [Sep 2016]

If this information is useful, please help other people find it:
Share via:

Tom Stellard via llvm-dev

2016-Sep-01 15:52 UTC

[llvm-dev] Memory scope proposal

On Wed, Aug 31, 2016 at 12:23:34PM -0700, Justin Lebar via llvm-dev
wrote:> > Some optimizations that are related to a single thread could be done
without needing to know the actual memory scope.
> 
> Right, it's clear to me that there exist optimizations that you cannot
> do if we model these ops as target-specific intrinsics.
> 
> But what I think Mehdi and I were trying to get at is: How much of a
> problem is this in practice?  Are there real-world programs that
> suffer because we miss these optimizations?  If so, how much?
> 
> The reason I'm asking this question is, there's a real cost to
adding
> complexity in LLVM.  Everyone in the project is going to pay that
> cost, forever (or at least, until we remove the feature :).  So I want
> to try to evaluate whether paying that cost is actually worth while,
> as compared to the simple alternative (i.e., intrinsics).  Given the
> tepid response to this proposal, I'm sort of thinking that now may not
> be the time to start paying this cost.  (We can always revisit this in
> the future.)  But I remain open to being convinced.
> 
I think the cost of adding this information to the IR is really low.
There is already a sychronization scope field present for LLVM atomic
instructions, and it is already being encoded as 32-bits, so it is
possible to represent the additional scopes using the existing bitcode
format.  Optimization passes are already aware of this synchronization
scope field, so they know how to preserve it when transforming the IR.

The primary goal here is to pass synchronization scope information from
the fronted to the backend.  We already have a mechanism for doing this,
so why not use it?  That seems like the lowest cost option to me.

-Tom
> As a point of comparison, we have a rule of thumb that we'll add an
> optimization that increases compilation time by x% if we have a
> benchmark that is sped up by at least x%.  Similarly here, I'd want to
> weigh the added complexity against the improvements to user code.
> 
> -Justin
> 
> On Tue, Aug 23, 2016 at 2:28 PM, Tye, Tony via llvm-dev
> <llvm-dev at lists.llvm.org> wrote:
> >> Since the scope is “opaque” and target specific, can you elaborate
what
> >> kind of generic optimization can be performed?
> >
> >
> >
> > Some optimizations that are related to a single thread could be done
without
> > needing to know the actual memory scope. For example, an atomic
acquire can
> > restrict reordering memory operations after it, but allow reordering
of
> > memory operations (except another atomic acquire) before it,
regardless of
> > the memory scope.
> >
> >
> >
> > Thanks,
> >
> > -Tony
> >
> >
> > _______________________________________________
> > LLVM Developers mailing list
> > llvm-dev at lists.llvm.org
> > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> >
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

Philip Reames via llvm-dev

2016-Sep-03 00:52 UTC

head link

[llvm-dev] Memory scope proposal

On 09/01/2016 08:52 AM, Tom Stellard via llvm-dev wrote:> On Wed, Aug 31, 2016 at 12:23:34PM -0700, Justin Lebar via llvm-dev wrote:
>>> Some optimizations that are related to a single thread could be
done without needing to know the actual memory scope.
>> Right, it's clear to me that there exist optimizations that you
cannot
>> do if we model these ops as target-specific intrinsics.
>>
>> But what I think Mehdi and I were trying to get at is: How much of a
>> problem is this in practice?  Are there real-world programs that
>> suffer because we miss these optimizations?  If so, how much?
>>
>> The reason I'm asking this question is, there's a real cost to
adding
>> complexity in LLVM.  Everyone in the project is going to pay that
>> cost, forever (or at least, until we remove the feature :).  So I want
>> to try to evaluate whether paying that cost is actually worth while,
>> as compared to the simple alternative (i.e., intrinsics).  Given the
>> tepid response to this proposal, I'm sort of thinking that now may
not
>> be the time to start paying this cost.  (We can always revisit this in
>> the future.)  But I remain open to being convinced.
>>
> I think the cost of adding this information to the IR is really low.
> There is already a sychronization scope field present for LLVM atomic
> instructions, and it is already being encoded as 32-bits, so it is
> possible to represent the additional scopes using the existing bitcode
> format.  Optimization passes are already aware of this synchronization
> scope field, so they know how to preserve it when transforming the IR.I disagree with this assessment.  Atomics are an area where additional 
complexity has a *substantial* conceptual cost.  I also question whether 
the single_thread scope is actually respected throughout the optimizer 
in practice.

I view the request of changing the IR as a fairly big ask.  In 
particular, I'm really nervous about what the exact optimization 
semantics of such scopes would be.  Depending on how that was defined, 
this could be anything from fairly straight forward to outright messy.  
In particular, if there are optimizations which are legal for only some 
subset of scopes (or subset of pairs of scopes?), I'd really like to see 
a clear definition given for how those are defined.

(p.s. Is there a current patch with an updated LangRef for the proposal 
being discussed?  I've lost track of it.)

Let me give an example proposal just to illustrate my point.  This isn't 
really a counter proposal per se, just me thinking out loud.

Say we added 32 distinct concurrent domains.  One of them is used for 
"single_thread".  One is used for "everything else".  The
remaining 30
are defined in a target specific manner w/the exception that they can't 
overlap with each other or with the two predefined ones.  The effect of 
a given atomic operation with respect to each concurrency domain could 
be defined in terms of a 32 bit mask.  If a bit was set, the operation 
is ordered (according to the separately stated ordering) with that 
domain.  If not, it is explicitly unordered w.r.t. that domain.  A 
memory operation would be tagged with the memory domains which which it 
might interact.

The key bit here is that I can describe transformations in terms of 
these abstract domains without knowing anything about how the frontend 
might be using such a domain or how the backend might lower it.  In 
particular, if I have the sequence:
%v = load i64, %p atomic scope {domain3 only}
fence seq_cst scope={domain1 only}
%v2 = load i64, %p atomic scope {domain3 only}

I can tell that the two loads aren't order with respect to the fence and 
that I can do load forwarding here.

In general, an IR extension needs to be well defined, general enough to 
be used by multiple distinct users, and fairly battle tested design 
wise.  We're not completely afraid of having to remove bad ideas from 
the IR, but we really try to avoid adding things until they're fairly 
proven.

>
> The primary goal here is to pass synchronization scope information from
> the fronted to the backend.  We already have a mechanism for doing this,
> so why not use it?  That seems like the lowest cost option to me.
>
> -Tom
>
>> As a point of comparison, we have a rule of thumb that we'll add an
>> optimization that increases compilation time by x% if we have a
>> benchmark that is sped up by at least x%.  Similarly here, I'd want
to
>> weigh the added complexity against the improvements to user code.
>>
>> -Justin
>>
>> On Tue, Aug 23, 2016 at 2:28 PM, Tye, Tony via llvm-dev
>> <llvm-dev at lists.llvm.org> wrote:
>>>> Since the scope is “opaque” and target specific, can you
elaborate what
>>>> kind of generic optimization can be performed?
>>>
>>>
>>> Some optimizations that are related to a single thread could be
done without
>>> needing to know the actual memory scope. For example, an atomic
acquire can
>>> restrict reordering memory operations after it, but allow
reordering of
>>> memory operations (except another atomic acquire) before it,
regardless of
>>> the memory scope.
>>>
>>>
>>>
>>> Thanks,
>>>
>>> -Tony
>>>
>>>
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> llvm-dev at lists.llvm.org
>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

Mehdi Amini via llvm-dev

2016-Sep-03 03:13 UTC

head link

[llvm-dev] Memory scope proposal

> On Sep 2, 2016, at 5:52 PM, Philip Reames via llvm-dev <llvm-dev at
lists.llvm.org> wrote:
> 
> 
> 
> On 09/01/2016 08:52 AM, Tom Stellard via llvm-dev wrote:
>> On Wed, Aug 31, 2016 at 12:23:34PM -0700, Justin Lebar via llvm-dev
wrote:
>>>> Some optimizations that are related to a single thread could be
done without needing to know the actual memory scope.
>>> Right, it's clear to me that there exist optimizations that you
cannot
>>> do if we model these ops as target-specific intrinsics.
>>> 
>>> But what I think Mehdi and I were trying to get at is: How much of
a
>>> problem is this in practice?  Are there real-world programs that
>>> suffer because we miss these optimizations?  If so, how much?
>>> 
>>> The reason I'm asking this question is, there's a real cost
to adding
>>> complexity in LLVM.  Everyone in the project is going to pay that
>>> cost, forever (or at least, until we remove the feature :).  So I
want
>>> to try to evaluate whether paying that cost is actually worth
while,
>>> as compared to the simple alternative (i.e., intrinsics).  Given
the
>>> tepid response to this proposal, I'm sort of thinking that now
may not
>>> be the time to start paying this cost.  (We can always revisit this
in
>>> the future.)  But I remain open to being convinced.
>>> 
>> I think the cost of adding this information to the IR is really low.
>> There is already a sychronization scope field present for LLVM atomic
>> instructions, and it is already being encoded as 32-bits, so it is
>> possible to represent the additional scopes using the existing bitcode
>> format.  Optimization passes are already aware of this synchronization
>> scope field, so they know how to preserve it when transforming the IR.
> I disagree with this assessment.  Atomics are an area where additional
complexity has a *substantial* conceptual cost.  I also question whether the
single_thread scope is actually respected throughout the optimizer in practice.
> 
> I view the request of changing the IR as a fairly big ask.  In particular,
I'm really nervous about what the exact optimization semantics of such
scopes would be.  Depending on how that was defined, this could be anything from
fairly straight forward to outright messy.  In particular, if there are
optimizations which are legal for only some subset of scopes (or subset of pairs
of scopes?), I'd really like to see a clear definition given for how those
are defined.
> (p.s. Is there a current patch with an updated LangRef for the proposal
being discussed?  I've lost track of it.)
Here is the patch: https://reviews.llvm.org/D21723


> 
> Let me give an example proposal just to illustrate my point.  This
isn't really a counter proposal per se, just me thinking out loud.
> 
> Say we added 32 distinct concurrent domains.  One of them is used for
"single_thread".  One is used for "everything else".  The
remaining 30 are defined in a target specific manner w/the exception that they
can't overlap with each other or with the two predefined ones.  The effect
of a given atomic operation with respect to each concurrency domain could be
defined in terms of a 32 bit mask.  If a bit was set, the operation is ordered
(according to the separately stated ordering) with that domain.  If not, it is
explicitly unordered w.r.t. that domain.  A memory operation would be tagged
with the memory domains which which it might interact.
> 
> The key bit here is that I can describe transformations in terms of these
abstract domains without knowing anything about how the frontend might be using
such a domain or how the backend might lower it.  In particular, if I have the
sequence:
> %v = load i64, %p atomic scope {domain3 only}
> fence seq_cst scope={domain1 only}
> %v2 = load i64, %p atomic scope {domain3 only}
> 
> I can tell that the two loads aren't order with respect to the fence
and that I can do load forwarding here.
I see the current proposal as a strip-down version what you describe: the
optimizer can reason about operations inside a single scope, but can’t assume
anything cross-scope (they may or may not interact with each other).

What you describes seems like having always non-overlapping domains (from the
optimizer point of view), and require the frontend to express the overlapping by
attaching a “list" of domains that an atomic operation interacts with.

I hope I make sense :)

Best,

— 
Mehdi



> 
> 
> In general, an IR extension needs to be well defined, general enough to be
used by multiple distinct users, and fairly battle tested design wise. 
We're not completely afraid of having to remove bad ideas from the IR, but
we really try to avoid adding things until they're fairly proven.
> 
> 
>> 
>> The primary goal here is to pass synchronization scope information from
>> the fronted to the backend.  We already have a mechanism for doing
this,
>> so why not use it?  That seems like the lowest cost option to me.
>> 
>> -Tom
>> 
>>> As a point of comparison, we have a rule of thumb that we'll
add an
>>> optimization that increases compilation time by x% if we have a
>>> benchmark that is sped up by at least x%.  Similarly here, I'd
want to
>>> weigh the added complexity against the improvements to user code.
>>> 
>>> -Justin
>>> 
>>> On Tue, Aug 23, 2016 at 2:28 PM, Tye, Tony via llvm-dev
>>> <llvm-dev at lists.llvm.org> wrote:
>>>>> Since the scope is “opaque” and target specific, can you
elaborate what
>>>>> kind of generic optimization can be performed?
>>>> 
>>>> 
>>>> Some optimizations that are related to a single thread could be
done without
>>>> needing to know the actual memory scope. For example, an atomic
acquire can
>>>> restrict reordering memory operations after it, but allow
reordering of
>>>> memory operations (except another atomic acquire) before it,
regardless of
>>>> the memory scope.
>>>> 
>>>> 
>>>> 
>>>> Thanks,
>>>> 
>>>> -Tony
>>>> 
>>>> 
>>>> _______________________________________________
>>>> LLVM Developers mailing list
>>>> llvm-dev at lists.llvm.org
>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>> 
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> llvm-dev at lists.llvm.org
>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> 
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
<http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160902/1a4d01eb/attachment-0001.html>

llvm dev - Sep 2016 - Memory scope proposal

[llvm-dev] Memory scope proposal

[llvm-dev] Memory scope proposal

[llvm-dev] Memory scope proposal