thr3ads.net - llvm dev - [LLVMdev] Atomic Operation and Synchronization Proposal v2 [Jul 2007]

If this information is useful, please help other people find it:
Share via:

Chandler Carruth

2007-Jul-12 18:08 UTC

[LLVMdev] Atomic Operation and Synchronization Proposal v2

On 7/12/07, Dan Gohman <djg at cray.com> wrote:> On Thu, Jul 12, 2007 at 10:06:04AM -0500, David Greene wrote:
> > On Thursday 12 July 2007 07:23, Torvald Riegel wrote:
> >
> > > > The single instruction constraints can, at their most
flexible, constrain
> > > > any set of possible pairings of loads from memory and stores
to memory
> > >
> > > I'm not sure about this, but can we get issues due to
"special" kinds of
> > > data transfers (such as vector stuff, DMA, ...?). Memcpy
implementations
> > > could be a one thing to look at.
> > > This kind of breaks down to how universal you want the memory
model to be.
> >
> > Right.  For example, the Cray X1 has a much richer set of memory
ordering
> > instructions than anything on the commodity micros:
> >
> > http://tinyurl.com/3agjjn
> >
> > The memory ordering intrinsics in the current llvm proposal can't
take
> > advantage of them because they are too coarse-grained.
>
> I guess the descriptions on that page are, heh, a little terse ;-).
A bit. ;] I was glad to see your clarification.
> The
> Cray X1 has a dimension of synchronization that isn't covered in this
> proposal, and that's the set of observers need to observe the ordering.
> For example you can synchronize a team of streams in a multi-streaming
> processor without requiring that the ordering of memory operations be
> observed by the entire system. That's what motivates most of the
variety
> in that list.
This is fascinating to me, personally. I don't know how reasonable it
is to implement directly in LLVM, however, could a codegen for the X1
in theory establish if the "shared memory" was part of a stream in a
multi-streaming processor, and use those local synchronization
routines? I'm not sure how reasonable this is. Alternatively, to
target this specific of an architecture, perhaps the LLVM code could
be annotated to show where it is operating on streams, versus across
processors, and allow that to guide the codegen decision as to which
type of synchronization to utilize. As LLVM doesn't really understand
the parallel implementation the code is running on, it seems like it
might be impossible to build this into LLVM without it being
X1-type-system specific... but perhaps you have better ideas how to do
such things from working on it for some time?
>
> There's one other specific aspect I'd like to point out here.
There's an
> "acquire" which orders prior *scalar* loads with *all* subsequent
memory
> accesses, and a "release" which orders *all* prior accesses with
subsequent
> *scalar* stores. The Cray X1's interest in distinguishing scalar
accesses
> from vector accesses is specific to its architecture, but in general, it
> is another case that motivates having more granularity than just "all
> loads" and "all stores".
This clarifies some of those instructions. Here is my thought on how
to fit this behavior in with the current proposal:

You're still ordering load-store pairings, there is juts the added
dimensionality of types. This seems like an easy extension to the
existing proposal to combine the load and store pairings with a type
dimension to achieve finer-grained control. Does this make sense as an
incremental step from your end with much more experience comparing
your hardware to LLVM's IR?
>
> Overall though, I'm quite happy to see that the newest revision of the
> proposal has switched from LLVM instructions to LLVM intrinsics. That
> will make it easier to experiment with extensions in the future. And
> having the string "atomic" right there in the names of each
operation
> is very much appreciated :-).
The atomic in the name is nice. It does make the syntax a bit less
elegant, but it'll get the ball rolling faster, and thats far more
important! Thanks for the input, and I really love the X1 example for
a radically different memory model from the architectures LLVM is
currently targeting.

-Chandler Carruth

>
> Dan
>
> --
> Dan Gohman, Cray Inc.
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>

David Greene

2007-Jul-12 21:51 UTC

head link

[LLVMdev] Atomic Operation and Synchronization Proposal v2

On Thursday 12 July 2007 13:08, Chandler Carruth wrote:
> > > Right.  For example, the Cray X1 has a much richer set of memory
> > > ordering instructions than anything on the commodity micros:
> > >
> > > http://tinyurl.com/3agjjn
> > >
> > > The memory ordering intrinsics in the current llvm proposal
can't take
> > > advantage of them because they are too coarse-grained.
> >
> > I guess the descriptions on that page are, heh, a little terse ;-).
>
> A bit. ;] I was glad to see your clarification.
Yeah, sorry.  I was heading out the door to a meeting when I posted the link.
I'm glad Dan clarified some things.  Unfortunately, I could only link to 
publicly-available documents.  Our internal ISA book explains this all much 
better.  :)
> > The
> > Cray X1 has a dimension of synchronization that isn't covered in
this
> > proposal, and that's the set of observers need to observe the
ordering.
> > For example you can synchronize a team of streams in a multi-streaming
> > processor without requiring that the ordering of memory operations be
> > observed by the entire system. That's what motivates most of the
variety
> > in that list.
>
> This is fascinating to me, personally. I don't know how reasonable it
> is to implement directly in LLVM, however, could a codegen for the X1
> in theory establish if the "shared memory" was part of a stream
in a
> multi-streaming processor, and use those local synchronization
> routines?
Absolutely.  The X1 compiler is responsible for partitioning loops to
run on multiple streams and synchronizing among the streams as
necessary.  That synchronization is at a level "above" general system
memory ordering.  The X1 has multiple levels of parallelism:

- Vectorization

- Decoupled vector/scalar execution (this is where the lsyncs come in)

- Multistreaming (the msync operations)

- Multiprocessing (global machine-wide synchronization via gsync)

The compiler is basically responsible for the first three levels while the
user does the fourth via MPI, OpenMP, CAF, UPC, etc.  In general
sometimes the user inserts directives to help the compiler with 1-3
but the compiler gets a lot of cases on its own automatically.
> I'm not sure how reasonable this is. Alternatively, to 
> target this specific of an architecture, perhaps the LLVM code could
> be annotated to show where it is operating on streams, versus across
> processors, and allow that to guide the codegen decision as to which
> type of synchronization to utilize. As LLVM doesn't really understand
> the parallel implementation the code is running on, it seems like it
> might be impossible to build this into LLVM without it being
> X1-type-system specific... but perhaps you have better ideas how to do
> such things from working on it for some time?
In a parallelizing compiler, the compiler must keep track of where it placed
data when it parallelized code as it must know how to handle dependencies
and insert synchronizations.  In the case of the X1, the compiler partitioned 
a loop to run on multiple cores, so it knows to use msyncs when that code 
accesses data shared among the cores.  The compiler determined which
data to share among the cores and which to keep private in each core.
Similarly, when it vectorizes, it knows the dependencies between vector and 
scalar operations and inserts the necessary lsyncs.

PGI, Pathscale and Intel, for example, are starting to talk about automatic 
OpenMP.  They will need to insert synchronizations across cores similarly to 
what's done on the X1.  Those will probably be some form of MFENCE.

The abstraction here is the "level" of parallelism.  Vectorization is
very
fine-grained.  Most implementations in hardware do not  need explicit software 
syncs between scalar and vector code.

The next level up is multithreading (we call that multistreaming on the X1
for historical reasons).  Depending on architecture, this could  happen within 
a single core (MTA style) or across multiple cores (X1 style), providing two 
distinct levels of parallelism and possibly two distinct sets of sync 
instructions in the general case.

Then you've got a level of parallelism around the set of sockets that are
cache coherent (so-called "local" processors, or a "node" in
X1 parlance).
You might have another set of sync instructions for this (the X1 does not).
Then you have the most general case of parallelism across "the system"
where communication time between processors is extremely long.  This is
the "gsync" level on the X1.

Other more sophisticated architectures may have even more levels of 
parallelism.

So in thinking about extending your work (which again, I want to stress is
not immediately necessary, but still good to think about), I would suggest
we think in terms of level of parallelization or perhaps "distance among
participants."  It's not a good idea to hard-code things like
"vector-scalar
sync" but I can imagine intrinsics that say, "order memory among these
participants," or "order memory at this level," or "order
memory between these
levels," where the levels are defined by the target architecture.  If a
target
doesn't have as many levels as used in the llvm code, then it can just
choose
to use a more expensive sync instruction.  In X1 terms, a gsync is a really 
big hammer, but it can always be used in place of an lsync.

I don't know if any plans exist to incorporate parallelizing transformations
into llvm, but I can certainly imagine building an auto-parallelizing 
infrastructure above it.  That infrastructure would have to communicate 
information down to llvm so it could generate code properly.  How to do that 
is an entirely other can of worms.  :)
> > There's one other specific aspect I'd like to point out here.
There's an
> > "acquire" which orders prior *scalar* loads with *all*
subsequent memory
> > accesses, and a "release" which orders *all* prior accesses
with
> > subsequent *scalar* stores. The Cray X1's interest in
distinguishing
> > scalar accesses from vector accesses is specific to its architecture,
but
> > in general, it is another case that motivates having more granularity
> > than just "all loads" and "all stores".
>
> This clarifies some of those instructions. Here is my thought on how
> to fit this behavior in with the current proposal:
>
> You're still ordering load-store pairings, there is juts the added
> dimensionality of types. This seems like an easy extension to the
> existing proposal to combine the load and store pairings with a type
> dimension to achieve finer-grained control. Does this make sense as an
> incremental step from your end with much more experience comparing
> your hardware to LLVM's IR?
This would work for X1-style lsyncs, but we should think about whether this is 
too architecture-specific.  Decoupled execution doesn't fit completely
snugly
into the "levels of parallelism" model I outlined above, so it's a
bit of an
oddball.  It's parallelism, but of a different form.  Commodity micros have
decoupled execution but they handle syncs in hardware (thus moving to/from 
a GPR and XMM is expensive).

The X1 fsync falls into the same category.  It's there because the X1 does
not
have precise traps for floating point code and doesn't really have anything
to
do with parallelization.  Ditto isync (all modern processors have some form 
of this to guard against self-modifying code).

The bottom line is that I don't have easy cut-and-dry answers.  I suspect
this
will be an organic process and we'll learn how to abstract these things in
an
architecture-independent manner as we go.

                                                          -Dave

David A. Greene

2007-Jul-13 02:07 UTC

head link

[LLVMdev] Atomic Operation and Synchronization Proposal v2

On Thursday 12 July 2007 16:51, David Greene wrote:
> > You're still ordering load-store pairings, there is juts the added
> > dimensionality of types. This seems like an easy extension to the
> > existing proposal to combine the load and store pairings with a type
> > dimension to achieve finer-grained control. Does this make sense as an
> > incremental step from your end with much more experience comparing
> > your hardware to LLVM's IR?
>
> This would work for X1-style lsyncs
I take that back.  Maybe.

If by "type" you literally mean the data type of the value (int,
float, etc.)
and the extent of the data (vector or scalar), then it won't handle the
X1 case where integer scalar instructions feed floating point vector 
instructions and similar combinations.

If by "type" you only mean the extent of the data, then it would work
fine.

                                         -Dave

Reasonably Related Threads

Search for more seemingly similar threads

llvm dev - Jul 2007 - [LLVMdev] Atomic Operation and Synchronization Proposal v2

[LLVMdev] Atomic Operation and Synchronization Proposal v2

[LLVMdev] Atomic Operation and Synchronization Proposal v2

[LLVMdev] Atomic Operation and Synchronization Proposal v2

Reasonably Related Threads