thr3ads.net - llvm dev - [LLVMdev] Predicated Vector Operations [May 2013]

If this information is useful, please help other people find it:
Share via:

dag at cray.com

2013-May-09 16:04 UTC

[LLVMdev] Predicated Vector Operations

Dan Gohman <dan433584 at gmail.com> writes:
>     But I don't understand why defining this as not being a data race
>     would complicate things. I'm assuming the mask values are
>     statically known.  Can you explain a bit more?
>
> It's an interesting question for autovectorization, for example.
>
> Thread A:
> for (i=0;i<n;++i)
> if (i&1)
> X[i] = 0;
>
> Thread B:
> for (i=0;i<n;++i)
> if (!(i&1))
> X[i] = 1;
>
> The threads run concurrently without synchronization. As written,
> there is no race. 
There is no race *if* the hardware cache coherence says so.  :) There
are false sharing issues here and different machines have behaved very
differently in the past.

The result entirely depends on the machine's consistency model.

LLVM is a virtual machine and the IR should define a consistency model.
Everything flows from that.  I think ideally we'd define the model such
that there is no race in the scalar code and the compiler would be free
to vectorize it.  This is a very strict consistency model and for
targets with relaxed semantics, LLVM would have to insert
synchronization operations or choose not to vectorize.

Presumably if the scalar code were problematic on a machine with relaxed
consistency, the user would have added synchronization primitives and
vectorization would not be possible.
> Can you vectorize either of these loops? If masked-out elements of a
> predicated store are "in play" for racing, then vectorizing would
> introduce a race. And, it'd be hard for an optimizer to prove that
> this doesn't happen.
Same answer.  I don't think scalar vs. vector matters.  This is mostly a
cache coherence issue.

There is one twist that our vectorization guy pointed out to me.  If
when vectorizing, we have threads A and B read the entire vector, update
the values under mask and then write the entire vector, clearly there
will be a data race introduced.  The Cray compiler has switches for
users to balance safety and performance, since a stride-one load and
store is generally much faster than a masked load and store.

So for vectorization, the answer is, "it depends on the target
consistency model and the style of vectorization chosen."
> p.s. Yes, you could also vectorize these with a strided store or a
> scatter, but then it raises a different question, of the memory
> semantics for strided or scatter stores.
And again, the same answer.  :)

I'm no vectorization expert, but I believe what I said is correct.  :)

                         -David

Chandler Carruth

2013-May-12 05:42 UTC

head link

[LLVMdev] Predicated Vector Operations

On Thu, May 9, 2013 at 6:04 PM, <dag at cray.com> wrote:
> Dan Gohman <dan433584 at gmail.com> writes:
>
> >     But I don't understand why defining this as not being a data
race
> >     would complicate things. I'm assuming the mask values are
> >     statically known.  Can you explain a bit more?
> >
> > It's an interesting question for autovectorization, for example.
> >
> > Thread A:
> > for (i=0;i<n;++i)
> > if (i&1)
> > X[i] = 0;
> >
> > Thread B:
> > for (i=0;i<n;++i)
> > if (!(i&1))
> > X[i] = 1;
> >
> > The threads run concurrently without synchronization. As written,
> > there is no race.
>
> There is no race *if* the hardware cache coherence says so.  :) There
> are false sharing issues here and different machines have behaved very
> differently in the past.
>
Let's not conflate races with false sharing. They're totally different,
and
false sharing is *not* what we're discussing here.

> The result entirely depends on the machine's consistency model.
>
> LLVM is a virtual machine and the IR should define a consistency model.
> Everything flows from that.  I think ideally we'd define the model such
> that there is no race in the scalar code and the compiler would be free
> to vectorize it.  This is a very strict consistency model and for
> targets with relaxed semantics, LLVM would have to insert
> synchronization operations or choose not to vectorize.

LLVM already has a memory model. We don't need to add one. ;] It's here
for
reference: http://llvm.org/docs/LangRef.html#memmodel

Also, cache coherency is *not* the right way to think of a memory model. It
makes it extremely hard to understand and define what optimization passes
are allowed to do. I think LLVM's memory model does a very good job of this
for both scalar and vector code today. If you spot problems with it, let's
start a thread to address them. I suspect myself, Jeffrey, and Owen will
all be extremely interested in discussing any such issues.

The only thing that isn't in the model that is relevant here is something
that isn't in LLVM today -- masked loads and stores. And that was what
inspired my original question. =D
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20130512/8e5801e6/attachment.html>

dag at cray.com

2013-May-13 15:20 UTC

head link

[LLVMdev] Predicated Vector Operations

Chandler Carruth <chandlerc at google.com> writes:
>     There is no race *if* the hardware cache coherence says so. :)
>     There
>     are false sharing issues here and different machines have behaved
>     very
>     differently in the past.
>
> Let's not conflate races with false sharing. They're totally
> different, and false sharing is *not* what we're discussing here.
But in the real world false sharing exists and the compiler has to deal
with it.  We can say, "make codegen deal with it," but these issues
bubble up to the target-independent optimizer nonetheless.

A theoretical memory model is good to have but it's often not
sufficient.
>     The result entirely depends on the machine's consistency model.
>     
>     LLVM is a virtual machine and the IR should define a consistency
>     model.
>     Everything flows from that. I think ideally we'd define the model
>     such
>     that there is no race in the scalar code and the compiler would be
>     free
>     to vectorize it. This is a very strict consistency model and for
>     targets with relaxed semantics, LLVM would have to insert
>     synchronization operations or choose not to vectorize.
>
> LLVM already has a memory model. We don't need to add one. ;] It's
> here for reference: http://llvm.org/docs/LangRef.html#memmodel
I started to look at http://llvm.org/docs/Atomics.html first for a
genter introduction and immediately spotted a problem.  Your first
example precluding register promotion for the update of x is hugely
pessimistic.  I don't particularly care because our optimizer has
already done the transformation before we hit LLVM.  :) But with that
restriction you're leaving a ton of performance on the table.

The same goes for vector code generation, in general.  Our vectorizer
has already done it.  But let's get this right for everyone.
> The only thing that isn't in the model that is relevant here is
> something that isn't in LLVM today -- masked loads and stores. And
> that was what inspired my original question. =D
FWIW, informally, the Cray compiler ignores any concurrency it did not
itself create.  It won't generally introduce loads and stores that
weren't there, but it will certainly eliminate any loads and stores it
can.  We do have atomic operations which generally behave like the LLVM
atomics.  The memory model looks a lot like the C abstract machine.  We
generally give the compiler free reign.

We let the Cray compiler do some unsafe optimization from time to time.
Turning a masked load/operation/masked store into a full load/blend/full
store is a common case.  Users can disable it if they want to be extra
careful.  We worry about false sharing, but only after a certain point
in translation.  These have proven to be very practical and effective
techniques.

I wrote about masked stores vs. full stores in a previous message.  I
believe mask stores should only write to unmasked elements.  It should
not trap on masked elements.  If a developer needs something more
flexible for performance, he or she can do an unsafe transformation,
knowing the implications of doing so.

                                  -David

Seemingly Similar Threads

Search for more apparently analagous threads

llvm dev - May 2013 - [LLVMdev] Predicated Vector Operations

[LLVMdev] Predicated Vector Operations

[LLVMdev] Predicated Vector Operations

[LLVMdev] Predicated Vector Operations

Seemingly Similar Threads