thr3ads.net - llvm dev - [LLVMdev] Predicated Vector Operations [May 2013]

If this information is useful, please help other people find it:
Share via:

dag at cray.com

2013-May-09 14:47 UTC

[LLVMdev] Predicated Vector Operations

Chandler Carruth <chandlerc at google.com> writes:
> What are the desired memory model semantics for a masked store?
> Specifically, let me suppose a simplified vector model of <2 x i64>
on
> an i64-word-size platform.
>
> masked_store(<42, 42>, Ptr, <true, false>)
>
> Does this write to the entier <2 x i64> object stored at Ptr or not?
No.  It writes one element.
> Put another way, consider:
>
> thread A:
> ...
> masked_store(<42, 42>, Ptr, <true, false>)
> ...
>
> thread B:
> ...
> masked_store(<42, 42>, Ptr, <false, true>)
> ...
>
> Assuming there is no specific synchronization relevant to Ptr between
> these two threads and their masked stores, does this form a data race
> or not?
It entirely depends on the hardware implementation.  In most cases I
would say yes due to cache conherence issues.  From a purely theoretical
machine that doesn't have false sharing, there would be no data race.

Of course this assumes that thread B won't access the element stored by
thread A and vice versa.
> From a memory model perspective, if this does *not* form a data race,
> that makes this tremendously more complex to implement, analyze, and
> optimize... I'm somewhat hopeful that the desired semantics are for
> this to form a datarace (and thus require synchronization when
> occurring in different threads like this).
Most of the time the compiler will not know the mask value and will have
to be conservative.  As Nadav has pointed out, what constitutes
"conservative" is entirely context-dependent.

But I don't understand why defining this as not being a data race would
complicate things.  I'm assuming the mask values are statically known.
Can you explain a bit more?

                                 -David

Dan Gohman

2013-May-09 15:36 UTC

head link

[LLVMdev] Predicated Vector Operations

On Thu, May 9, 2013 at 7:47 AM, <dag at cray.com> wrote:
> Chandler Carruth <chandlerc at google.com> writes:
>
> > What are the desired memory model semantics for a masked store?
> > Specifically, let me suppose a simplified vector model of <2 x
i64> on
> > an i64-word-size platform.
> >
> > masked_store(<42, 42>, Ptr, <true, false>)
> >
> > Does this write to the entier <2 x i64> object stored at Ptr or
not?
>
> No.  It writes one element.
>
> > Put another way, consider:
> >
> > thread A:
> > ...
> > masked_store(<42, 42>, Ptr, <true, false>)
> > ...
> >
> > thread B:
> > ...
> > masked_store(<42, 42>, Ptr, <false, true>)
> > ...
> >
> > Assuming there is no specific synchronization relevant to Ptr between
> > these two threads and their masked stores, does this form a data race
> > or not?
>
> It entirely depends on the hardware implementation.  In most cases I
> would say yes due to cache conherence issues.  From a purely theoretical
> machine that doesn't have false sharing, there would be no data race.
>
> Of course this assumes that thread B won't access the element stored by
> thread A and vice versa.
>
> > From a memory model perspective, if this does *not* form a data race,
> > that makes this tremendously more complex to implement, analyze, and
> > optimize... I'm somewhat hopeful that the desired semantics are
for
> > this to form a datarace (and thus require synchronization when
> > occurring in different threads like this).
>
> Most of the time the compiler will not know the mask value and will have
> to be conservative.  As Nadav has pointed out, what constitutes
> "conservative" is entirely context-dependent.
>
> But I don't understand why defining this as not being a data race would
> complicate things.  I'm assuming the mask values are statically known.
> Can you explain a bit more?
>
It's an interesting question for autovectorization, for example.

Thread A:
   for (i=0;i<n;++i)
      if (i&1)
        X[i] = 0;

Thread B:
   for (i=0;i<n;++i)
      if (!(i&1))
        X[i] = 1;

The threads run concurrently without synchronization. As written, there is
no race. Can you vectorize either of these loops? If masked-out elements of
a predicated store are "in play" for racing, then vectorizing would
introduce a race. And, it'd be hard for an optimizer to prove that this
doesn't happen.

Dan

p.s. Yes, you could also vectorize these with a strided store or a scatter,
but then it raises a different question, of the memory semantics for
strided or scatter stores.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20130509/04710130/attachment.html>

dag at cray.com

2013-May-09 16:04 UTC

head link

[LLVMdev] Predicated Vector Operations

Dan Gohman <dan433584 at gmail.com> writes:
>     But I don't understand why defining this as not being a data race
>     would complicate things. I'm assuming the mask values are
>     statically known.  Can you explain a bit more?
>
> It's an interesting question for autovectorization, for example.
>
> Thread A:
> for (i=0;i<n;++i)
> if (i&1)
> X[i] = 0;
>
> Thread B:
> for (i=0;i<n;++i)
> if (!(i&1))
> X[i] = 1;
>
> The threads run concurrently without synchronization. As written,
> there is no race. 
There is no race *if* the hardware cache coherence says so.  :) There
are false sharing issues here and different machines have behaved very
differently in the past.

The result entirely depends on the machine's consistency model.

LLVM is a virtual machine and the IR should define a consistency model.
Everything flows from that.  I think ideally we'd define the model such
that there is no race in the scalar code and the compiler would be free
to vectorize it.  This is a very strict consistency model and for
targets with relaxed semantics, LLVM would have to insert
synchronization operations or choose not to vectorize.

Presumably if the scalar code were problematic on a machine with relaxed
consistency, the user would have added synchronization primitives and
vectorization would not be possible.
> Can you vectorize either of these loops? If masked-out elements of a
> predicated store are "in play" for racing, then vectorizing would
> introduce a race. And, it'd be hard for an optimizer to prove that
> this doesn't happen.
Same answer.  I don't think scalar vs. vector matters.  This is mostly a
cache coherence issue.

There is one twist that our vectorization guy pointed out to me.  If
when vectorizing, we have threads A and B read the entire vector, update
the values under mask and then write the entire vector, clearly there
will be a data race introduced.  The Cray compiler has switches for
users to balance safety and performance, since a stride-one load and
store is generally much faster than a masked load and store.

So for vectorization, the answer is, "it depends on the target
consistency model and the style of vectorization chosen."
> p.s. Yes, you could also vectorize these with a strided store or a
> scatter, but then it raises a different question, of the memory
> semantics for strided or scatter stores.
And again, the same answer.  :)

I'm no vectorization expert, but I believe what I said is correct.  :)

                         -David

Chandler Carruth

2013-May-12 05:36 UTC

head link

[LLVMdev] Predicated Vector Operations

On Thu, May 9, 2013 at 4:47 PM, <dag at cray.com> wrote:
> Chandler Carruth <chandlerc at google.com> writes:
>
> > What are the desired memory model semantics for a masked store?
> > Specifically, let me suppose a simplified vector model of <2 x
i64> on
> > an i64-word-size platform.
> >
> > masked_store(<42, 42>, Ptr, <true, false>)
> >
> > Does this write to the entier <2 x i64> object stored at Ptr or
not?
>
> No.  It writes one element.
>
Is this a statement about all of the existing hardware that supports masked
stores, or about the desired semantics in your mind for the IR model?

>
> > Put another way, consider:
> >
> > thread A:
> > ...
> > masked_store(<42, 42>, Ptr, <true, false>)
> > ...
> >
> > thread B:
> > ...
> > masked_store(<42, 42>, Ptr, <false, true>)
> > ...
> >
> > Assuming there is no specific synchronization relevant to Ptr between
> > these two threads and their masked stores, does this form a data race
> > or not?
>
> It entirely depends on the hardware implementation.

Well, in practice yes, but I'm asking how it should be modeled in the IR.
We aren't constrained to the option of having a *different* masked store
intrinsic for every hardware architecture that supports masked stores, and
it would seem strange to not look for a reasonable target independent
abstraction which we can teach the middle-end optimizers about (even if it
does take the form of intrinsics). Maybe there is no such abstraction? That
in and of itself would be surprising to me.

> In most cases I
> would say yes due to cache conherence issues.  From a purely theoretical
> machine that doesn't have false sharing, there would be no data race.
>
I think you're trying to reason about this from a hardware perspective, and
I'm trying to talk about what the right theoretical model for the memory
model is... While hardware is one constraint on that, there are others as
well, so it's worrisome to talk about both a theoretical machine without
false sharing and cache coherency issues when trying to determine if two
operations form a data race.

>
> Of course this assumes that thread B won't access the element stored by
> thread A and vice versa.
>
If we assume that, we have the conclusion -- there is on datarace. The
entire question is whether or not the masked element is notionally
"stored"
(but without changing the value afterward), or not.
>From Nadav's link, for example, it appears that AVX *does* actually do afull store of the 256-bit vector, but it does it atomically which precludes
data races.
> From a memory model perspective, if this does *not* form a data race,
> > that makes this tremendously more complex to implement, analyze, and
> > optimize... I'm somewhat hopeful that the desired semantics are
for
> > this to form a datarace (and thus require synchronization when
> > occurring in different threads like this).
>
> Most of the time the compiler will not know the mask value and will have
> to be conservative.  As Nadav has pointed out, what constitutes
> "conservative" is entirely context-dependent.
> But I don't understand why defining this as not being a data race would
> complicate things.  I'm assuming the mask values are statically known.
> Can you explain a bit more?
>
If my example would form a datarace, then when the optimizer sees such a
non-atomic stores, *and doesn't know the mask statically* (as I agree, that
is the common case), it would know that the entire vector store was
independent from stores to the same memory locations in other threads. It
could even model the operation of a masked store as a load from that
address, masking both the loaded vector and the incoming vector
(literally), or-ing (or blending) them together, and storing the result
back out. This is a fairly simple model, and easy to reason about. I would
even suggest that perhaps this is how we should represent it in IR.

However, if my example does not form a datarace, then when the optimizer
sees such a non-atomic store, *and doesn't know the mask statically*, it
has to assume that the mask may dynamically preclude storing to a memory
location that is being concurrently accessed. It cannot speculatively load
the vector stored there, and perform an explicit mask and blend and a full
store. It essentially means that if we see a masked store and don't know
the mask, then even if we know the address statically, that doesn't matter
because the mask could effectively index into that address and select a
single element to store to.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20130512/0779ac21/attachment.html>

dag at cray.com

2013-May-13 14:51 UTC

head link

[LLVMdev] Predicated Vector Operations

Chandler Carruth <chandlerc at google.com> writes:
>     > What are the desired memory model semantics for a masked store?
>     > Specifically, let me suppose a simplified vector model of <2 x
>     i64> on
>     > an i64-word-size platform.
>     >
>     > masked_store(<42, 42>, Ptr, <true, false>)
>     >
>     > Does this write to the entier <2 x i64> object stored at Ptr
or
>     not?
>     
>     
>     No. It writes one element.
>
> Is this a statement about all of the existing hardware that supports
> masked stores, or about the desired semantics in your mind for the IR
> model?
I made the comment thinking about hardware.  If it were to write all
elements, what would it write?  The old value?  That could be useful in
some cases (the performance issue I mentioned below).  But it also
presents problems for mem2reg/SSA.

This discussion might lead us to wanting a couple of flavors of masked
load and store.  I'm not sure.
>     
>     > Put another way, consider:
>     >
>     > thread A:
>     > ...
>     > masked_store(<42, 42>, Ptr, <true, false>)
>     > ...
>     >
>     > thread B:
>     > ...
>     > masked_store(<42, 42>, Ptr, <false, true>)
>     > ...
>     >
>     > Assuming there is no specific synchronization relevant to Ptr
>     between
>     > these two threads and their masked stores, does this form a data
>     race
>     > or not?
>     
>     
>     It entirely depends on the hardware implementation.
>
> Well, in practice yes, but I'm asking how it should be modeled in the
> IR. We aren't constrained to the option of having a *different* masked
> store intrinsic for every hardware architecture that supports masked
> stores, and it would seem strange to not look for a reasonable target
> independent abstraction which we can teach the middle-end optimizers
> about (even if it does take the form of intrinsics). Maybe there is no
> such abstraction? That in and of itself would be surprising to me.
I agree.  We should choose the semantics that gives the optimizer the
most freedom.
> From Nadav's link, for example, it appears that AVX *does* actually do
> a full store of the 256-bit vector, but it does it atomically which
> precludes data races.
By "atomically," you mean that all elements are written before any
other
operation is allowed to read or store to them?
> If my example would form a datarace, then when the optimizer sees such
> a non-atomic stores, *and doesn't know the mask statically* (as I
> agree, that is the common case), it would know that the entire vector
> store was independent from stores to the same memory locations in
> other threads. It could even model the operation of a masked store as
> a load from that address, masking both the loaded vector and the
> incoming vector (literally), or-ing (or blending) them together, and
> storing the result back out. This is a fairly simple model, and easy
> to reason about. I would even suggest that perhaps this is how we
> should represent it in IR.
Note that from a hardware perspective, the store may or may not cause a
data race depending on alignment and whether the store crosses a cache
line boundary.
> However, if my example does not form a datarace, then when the
> optimizer sees such a non-atomic store, *and doesn't know the mask
> statically*, it has to assume that the mask may dynamically preclude
> storing to a memory location that is being concurrently accessed. It
> cannot speculatively load the vector stored there, and perform an
> explicit mask and blend and a full store. It essentially means that if
> we see a masked store and don't know the mask, then even if we know
> the address statically, that doesn't matter because the mask could
> effectively index into that address and select a single element to
> store to.
You want to do a full vector load, update and store.  So do we most of
the time, for performance.  :)

I think you may be getting hung up a bit.  What you describe in this
paragraph isn't a masked store at all.  In fact you explicitly state
it's "a full store."

Given your code snippet, if we assume there is no data race in the
scalar code *and* we assume that vector stores are atomic, then the
compiler has two choices on how to translate the code.  It can be unsafe
and do the fast full load, merge, full store sequence or it can do a
slower hardware masked store.  The Cray compiler will do both depending
on switches or directives given by the user.  It is too difficult to
know statically which is the best choice.  I believe that by default we
do the fast code and the user can turn on switches to generate safe code
and see if that fixes problems.  :)

I think we wil need both options in LLVM.  There's just no way to pick
one and always have the write answer.  I think we only need one masked
store intrinsic.  That intrinsics would *not* write to masked elements.
If the compiler wants to be entirely safe, it should use that.
Otherwise it should feel free to use the full load/merge/full store
sequence.

I will run this by our vector expert to see what he thinks.

                                -David

Maybe Matching Threads

Search for more apparently analagous threads

llvm dev - May 2013 - [LLVMdev] Predicated Vector Operations

[LLVMdev] Predicated Vector Operations

[LLVMdev] Predicated Vector Operations

[LLVMdev] Predicated Vector Operations

[LLVMdev] Predicated Vector Operations

[LLVMdev] Predicated Vector Operations

Maybe Matching Threads