thr3ads.net - llvm dev - [LLVMdev] [RFC][PATCH][OPENCL] synchronization scopes redux [Jan 2015]

If this information is useful, please help other people find it:
Share via:

Sahasrabuddhe, Sameer

2015-Jan-06 08:31 UTC

[LLVMdev] [RFC][PATCH][OPENCL] synchronization scopes redux

On 1/6/2015 1:01 PM, Chandler Carruth wrote:>
> On Mon, Jan 5, 2015 at 10:51 PM, Owen Anderson <resistor at mac.com 
> <mailto:resistor at mac.com>> wrote:
>
>     Hi Sameer,
>
>     > On Jan 5, 2015, at 4:51 AM, Sahasrabuddhe, Sameer
>     <Sameer.Sahasrabuddhe at amd.com
>     <mailto:Sameer.Sahasrabuddhe at amd.com>> wrote:
>     >
>     > Right. The second version of my patches fixes the bitcode
>     encoding. But now I see another potential problem with future
>     bitcode if we require an ordering on the scopes. What happens when
>     a backend later introduces a new scope that goes into the middle
>     of the order? If they renumber the scopes to accomodate this, then
>     existing bitcode for that backend will no longer work. The bitcode
>     reader/writer cannot compensate for this since the values are
>     backend-specific. If we agree that this problem is real, then we
>     cannot force an ordering on the scope numbers.
>
>     That’s an interesting consideration, and something I hadn’t
>     thought of.  I’m unsure offhand of how much it matters in
>     practice.  The alternative, I suppose, is having something like
>     string-named scopes, but then we can’t do much with them at the IR
>     level.
>
>
> This has me somewhat non-plussed as well.
That really depends on what we want to do at the IR level. Scopes do not 
affect transformations that move non-atomic accesses around atomic 
accesses. The scope on the atomic access should not matter to the 
non-atomic accesses. The interesting case is when the compiler tries to 
optimize atomic accesses with respect to each other, and their scopes do 
not match. But it might be sufficient to leave such transformations to 
the target, since quite possibly, other target-specific information 
might be necessary to make them work or even to say whether they are 
beneficial.
>
>     > So far, I have refrained from proposing a keyword for cross
>     thread scope in the text format, because (a) there never was one
>     and (b) it is not strictly needed since it is the default anyway.
>     I am fine either way, but we will first have to decide what the
>     new keyword should be. I find "allthreads" to be a decent
>     counterpart for "singlethread" ... "crossthread" is
not good
>     enough since intermediate scopes have multiple threads too.
>
>     This actually raises another question.  In principle, the “most
>     visible” scope ought to be something like “system” or “device”,
>     meaning a completely uncached memory access that is visible to all
>     peripherals in a heterogeneous system.  However, this is almost
>     certainly not what we want to have for typical memory accesses.
>
>     To summarize, a prototypical scope nest, from most to least
>     visible (aka least to most cacheable) might look like:
>
>     System  —>  AllThreads  —>  Various target-specific local scopes
>     —> SingleThread
>
>     If we wanted to go really gonzo, there could be a Network scope at
>     the beginning for large-scale HPC systems, but I’m not sure how
>     important that is to anyone.
>
>
> I probably *should* be in a position to be very interested in such a 
> concept.... but honestly, I'm not. If I ever wanted to do something 
> like this, I would just define the large-scale HPC system as the 
> "system" and a single machine/node as some "local"
scope.
I agree. The most accurate description of the highest scope is "address 
space scope", i.e., all threads that can access the address space being 
accessed. From that view, it does not matter if the threads are local, 
remote or situated on different devices, or such. It makes sense to not 
specify any keyword for this scope, and just say that "synchscope(0)"
is
default and need not be specified. Any other scope is an explicit 
optimization over a narrower set of threads.
>     As a related question, do we actually need the local scopes to be
>     target specific?  Are there systems, real or planned, that
>     *aren’t* captured by:
>
>     [Network —> ] System  —>  AllThreads  —> ThreadGroup —>
SingleThread ?
>
>
> Sadly, I don't think this will work. In particular, there are 
> real-world accelerators with multiple tiers of thread groups that are 
> visible in the cache hierarchy subsystem.
The HSAIL 1.0 provisional spec has the following scopes: workitem, 
wavefront, workgroup, component, system. A component is anything that 
supports the HSAIL instruction set and can execute commands dispatched 
to it. I am not an authority on this, but to me, it is conceivable that 
there could be other scopes later, analogous to things such as one
"die"
or one "chip" or one "board" or one node in a cloud.
>
> I'm starting to think we might actually need to let the target define 
> acceptable strings for memory scopes and a strict weak ordering over 
> them.... That's really complex and heavy weight, but I'm not really
> confident that we're safe committing to something more limited. The 
> good side is that we can add the SWO-stuff lazily as needed...
>
> Dunno, thoughts?
Just the thought of using strings in the IR smells like over-design to 
me. Going back to the original point, are target-independent 
optimizations on scoped atomic operations really so attractive?

But while the topic is wide open, here's another possibly whacky 
approach: we let the scopes be integers, and add a "scope layout"
string
similar to data-layout. The string encodes the ordering of the integers. 
If it is empty, then simple numerical comparisons are sufficient. Else 
the string spells out the exact ordering to be used. Any known current 
target will be happy with the first option. If some target inserts an 
intermediate scope in the future, then that version switches from empty 
to a fully specified string. The best part is that we don't even need to 
do this right now, and only come up with a "scope layout" spec when we
really hit the problem for some future target.

Sameer.

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150106/c5f63ea8/attachment.html>

Chandler Carruth

2015-Jan-07 03:29 UTC

head link

[LLVMdev] [RFC][PATCH][OPENCL] synchronization scopes redux

On Tue, Jan 6, 2015 at 12:31 AM, Sahasrabuddhe, Sameer <
sameer.sahasrabuddhe at amd.com> wrote:
> I'm starting to think we might actually need to let the target define
> acceptable strings for memory scopes and a strict weak ordering over
> them.... That's really complex and heavy weight, but I'm not really
> confident that we're safe committing to something more limited. The
good
> side is that we can add the SWO-stuff lazily as needed...
>
>  Dunno, thoughts?
>
>
> Just the thought of using strings in the IR smells like over-design to me.
> Going back to the original point, are target-independent optimizations on
> scoped atomic operations really so attractive?

Essentially, I think target-independent optimizations are still attractive,
but we might want to just force them to go through an actual
target-implemented API to interpret the scopes rather than making the
interpretation work from first principles. I just worry that the targets
are going to be too different and we may fail to accurately predict future
targets' needs.

I think the "strings" can be made relatively clean.

What I'm imagining is something very much like the target-specific
attributes which are just strings and left to the target to interpret, but
are cleanly factored so that the strings are wrapped up in a nice opaque
attribute that is used as the sigil everywhere in the IR. We could do this
with metadata, and technically this fits the model of metadata if we make
the interpretation of the absence of metadata be "system". However,
I'm
quite hesitant to rely on metadata here as it hasn't always ended up
working so well for us. ;]

I'd be interested in your thoughts and others' thoughts on how me might
encode an opaque string-based scope effectively. If we can find a
reasonably clean way of doing it, it seems like the best approach at this
point:

- It ensures we have no bitcode stability problems.
- It makes it easy to define a small number of IR-specified values like
system/crossthread/allthreads/whatever and singlethread, and doing so isn't
ever awkward due to any kind of baked-in ordering.
- In practice in the real world, every target is probably going to just
take this and map it to an enum that clearly spells out the rank for their
target, so I suspect it won't actually increase the complexity of things
much.

>
>
> But while the topic is wide open, here's another possibly whacky
approach:
> we let the scopes be integers, and add a "scope layout" string
similar to
> data-layout. The string encodes the ordering of the integers. If it is
> empty, then simple numerical comparisons are sufficient. Else the string
> spells out the exact ordering to be used. Any known current target will be
> happy with the first option. If some target inserts an intermediate scope
> in the future, then that version switches from empty to a fully specified
> string. The best part is that we don't even need to do this right now,
and
> only come up with a "scope layout" spec when we really hit the
problem for
> some future target.

This isn't a bad approach, but it seems even more complex. I think I'd
rather go with the fairly boring one where the IR just encodes enough data
for the target to answer queries about the relationship between scopes.

So, my current leaning is to try to figure out a reasonably clean way to
use strings, similar to the target-specific attributes.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150106/ac8d4368/attachment.html>

Sahasrabuddhe, Sameer

2015-Jan-07 04:06 UTC

head link

[LLVMdev] [RFC][PATCH][OPENCL] synchronization scopes redux

On 1/7/2015 8:59 AM, Chandler Carruth wrote:>
> Essentially, I think target-independent optimizations are still 
> attractive, but we might want to just force them to go through an 
> actual target-implemented API to interpret the scopes rather than 
> making the interpretation work from first principles. I just worry 
> that the targets are going to be too different and we may fail to 
> accurately predict future targets' needs.
If we have a target-implemented API, then just opaque numbers should 
also be sufficient, right? For the API, all we care about is queries 
that interesting optimizations will want answered from the target. This 
could be at the instruction level: "is it okay to remove this atomic 
store with scope n1 that is immediately followed by atomic store with 
scope n2?". Or it could be at the scope level: "does scope n2 include 
scope n1"?
> I think the "strings" can be made relatively clean.
>
> What I'm imagining is something very much like the target-specific 
> attributes which are just strings and left to the target to interpret, 
> but are cleanly factored so that the strings are wrapped up in a nice 
> opaque attribute that is used as the sigil everywhere in the IR. We 
> could do this with metadata, and technically this fits the model of 
> metadata if we make the interpretation of the absence of metadata be 
> "system". However, I'm quite hesitant to rely on metadata
here as it
> hasn't always ended up working so well for us. ;]
Metadata was the first thing to be considered internally at AMD. But it 
was quickly shot down because the Research guys were unwilling to accept 
the possibility of scope being lost and replaced by a default "system"
scope. Current models are useful only when all atomic accesses for a 
given location use the same scope throughout the application, i.e., all 
threads running on all agents. So it is not okay for the compiler to 
"promote" the scope in just one kernel unless it has access to the 
entire application; the result is undefined. This is true for OpenCL 
source as well as HSAIL target. This may change in the near furture:

HRF-Relaxed: Adapting HRF to the complexities of industrial 
heterogeneous memory models
http://benedictgaster.org/?page_id=278

But even then, it will be difficult to say if the same models can be 
applied to heterogeneous systems that don't resemble OpenCL or HSAIL.
> I'd be interested in your thoughts and others' thoughts on how me 
> might encode an opaque string-based scope effectively. If we can find 
> a reasonably clean way of doing it, it seems like the best approach at 
> this point:
>
> - It ensures we have no bitcode stability problems.
> - It makes it easy to define a small number of IR-specified values 
> like system/crossthread/allthreads/whatever and singlethread, and 
> doing so isn't ever awkward due to any kind of baked-in ordering.
> - In practice in the real world, every target is probably going to 
> just take this and map it to an enum that clearly spells out the rank 
> for their target, so I suspect it won't actually increase the 
> complexity of things much.
I seem to be missing something here about the need for strings. If they 
are opaque anyway, and they are represented by sigils, then the sigils 
themselves are all that matter, right? Then the encoding is just a number...
>     But while the topic is wide open, here's another possibly whacky
>     approach: we let the scopes be integers, and add a "scope
layout"
>     string similar to data-layout. The string encodes the ordering of
>     the integers. If it is empty, then simple numerical comparisons
>     are sufficient. Else the string spells out the exact ordering to
>     be used. Any known current target will be happy with the first
>     option. If some target inserts an intermediate scope in the
>     future, then that version switches from empty to a fully specified
>     string. The best part is that we don't even need to do this right
>     now, and only come up with a "scope layout" spec when we
really
>     hit the problem for some future target.
>
>
> This isn't a bad approach, but it seems even more complex. I think
I'd
> rather go with the fairly boring one where the IR just encodes enough 
> data for the target to answer queries about the relationship between 
> scopes.
I am not really championing scope layout strings over a 
target-implemented API, but it seems less work to me rather than more. 
The relationship between scopes is just an SWO, and it can be 
represented as a graph. A practical target will have a very small number 
of scopes, say not more than 16. It should be possible to encode this 
into a graphviz-style string. Then instead of having every target 
implement an API, they just have to specify the relationship as a string.

Sameer.

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150107/41d2e27f/attachment.html>

Owen Anderson

2015-Jan-07 16:38 UTC

head link

[LLVMdev] [RFC][PATCH][OPENCL] synchronization scopes redux

> On Jan 6, 2015, at 2:31 AM, Sahasrabuddhe, Sameer <Sameer.Sahasrabuddhe
at amd.com> wrote:
> 
>> I probably *should* be in a position to be very interested in such a
concept.... but honestly, I'm not. If I ever wanted to do something like
this, I would just define the large-scale HPC system as the "system"
and a single machine/node as some "local" scope.
> 
> I agree. The most accurate description of the highest scope is
"address space scope", i.e., all threads that can access the address
space being accessed. From that view, it does not matter if the threads are
local, remote or situated on different devices, or such. It makes sense to not
specify any keyword for this scope, and just say that "synchscope(0)"
is default and need not be specified. Any other scope is an explicit
optimization over a narrower set of threads.
I want to point out that “address space” is *not* sufficient for the highest
scope.  It’s entirely possible to have a host and and an accelerator that do not
have shared address spaces, but do need to communicate, particularly in job
management code where the accelerator talks directly to the host-side driver.

—Owen
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150107/741e68df/attachment.html>

Maybe Matching Threads

Search for more possibly parallel threads

llvm dev - Jan 2015 - [LLVMdev] [RFC][PATCH][OPENCL] synchronization scopes redux

[LLVMdev] [RFC][PATCH][OPENCL] synchronization scopes redux

[LLVMdev] [RFC][PATCH][OPENCL] synchronization scopes redux

[LLVMdev] [RFC][PATCH][OPENCL] synchronization scopes redux

[LLVMdev] [RFC][PATCH][OPENCL] synchronization scopes redux

Maybe Matching Threads