Chandler Carruth <chandlerc at google.com> writes:> What are the desired memory model semantics for a masked store? > Specifically, let me suppose a simplified vector model of <2 x i64> on > an i64-word-size platform. > > masked_store(<42, 42>, Ptr, <true, false>) > > Does this write to the entier <2 x i64> object stored at Ptr or not?No. It writes one element.> Put another way, consider: > > thread A: > ... > masked_store(<42, 42>, Ptr, <true, false>) > ... > > thread B: > ... > masked_store(<42, 42>, Ptr, <false, true>) > ... > > Assuming there is no specific synchronization relevant to Ptr between > these two threads and their masked stores, does this form a data race > or not?It entirely depends on the hardware implementation. In most cases I would say yes due to cache conherence issues. From a purely theoretical machine that doesn't have false sharing, there would be no data race. Of course this assumes that thread B won't access the element stored by thread A and vice versa.> From a memory model perspective, if this does *not* form a data race, > that makes this tremendously more complex to implement, analyze, and > optimize... I'm somewhat hopeful that the desired semantics are for > this to form a datarace (and thus require synchronization when > occurring in different threads like this).Most of the time the compiler will not know the mask value and will have to be conservative. As Nadav has pointed out, what constitutes "conservative" is entirely context-dependent. But I don't understand why defining this as not being a data race would complicate things. I'm assuming the mask values are statically known. Can you explain a bit more? -David
On Thu, May 9, 2013 at 7:47 AM, <dag at cray.com> wrote:> Chandler Carruth <chandlerc at google.com> writes: > > > What are the desired memory model semantics for a masked store? > > Specifically, let me suppose a simplified vector model of <2 x i64> on > > an i64-word-size platform. > > > > masked_store(<42, 42>, Ptr, <true, false>) > > > > Does this write to the entier <2 x i64> object stored at Ptr or not? > > No. It writes one element. > > > Put another way, consider: > > > > thread A: > > ... > > masked_store(<42, 42>, Ptr, <true, false>) > > ... > > > > thread B: > > ... > > masked_store(<42, 42>, Ptr, <false, true>) > > ... > > > > Assuming there is no specific synchronization relevant to Ptr between > > these two threads and their masked stores, does this form a data race > > or not? > > It entirely depends on the hardware implementation. In most cases I > would say yes due to cache conherence issues. From a purely theoretical > machine that doesn't have false sharing, there would be no data race. > > Of course this assumes that thread B won't access the element stored by > thread A and vice versa. > > > From a memory model perspective, if this does *not* form a data race, > > that makes this tremendously more complex to implement, analyze, and > > optimize... I'm somewhat hopeful that the desired semantics are for > > this to form a datarace (and thus require synchronization when > > occurring in different threads like this). > > Most of the time the compiler will not know the mask value and will have > to be conservative. As Nadav has pointed out, what constitutes > "conservative" is entirely context-dependent. > > But I don't understand why defining this as not being a data race would > complicate things. I'm assuming the mask values are statically known. > Can you explain a bit more? >It's an interesting question for autovectorization, for example. Thread A: for (i=0;i<n;++i) if (i&1) X[i] = 0; Thread B: for (i=0;i<n;++i) if (!(i&1)) X[i] = 1; The threads run concurrently without synchronization. As written, there is no race. Can you vectorize either of these loops? If masked-out elements of a predicated store are "in play" for racing, then vectorizing would introduce a race. And, it'd be hard for an optimizer to prove that this doesn't happen. Dan p.s. Yes, you could also vectorize these with a strided store or a scatter, but then it raises a different question, of the memory semantics for strided or scatter stores. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20130509/04710130/attachment.html>
Dan Gohman <dan433584 at gmail.com> writes:> But I don't understand why defining this as not being a data race > would complicate things. I'm assuming the mask values are > statically known. Can you explain a bit more? > > It's an interesting question for autovectorization, for example. > > Thread A: > for (i=0;i<n;++i) > if (i&1) > X[i] = 0; > > Thread B: > for (i=0;i<n;++i) > if (!(i&1)) > X[i] = 1; > > The threads run concurrently without synchronization. As written, > there is no race.There is no race *if* the hardware cache coherence says so. :) There are false sharing issues here and different machines have behaved very differently in the past. The result entirely depends on the machine's consistency model. LLVM is a virtual machine and the IR should define a consistency model. Everything flows from that. I think ideally we'd define the model such that there is no race in the scalar code and the compiler would be free to vectorize it. This is a very strict consistency model and for targets with relaxed semantics, LLVM would have to insert synchronization operations or choose not to vectorize. Presumably if the scalar code were problematic on a machine with relaxed consistency, the user would have added synchronization primitives and vectorization would not be possible.> Can you vectorize either of these loops? If masked-out elements of a > predicated store are "in play" for racing, then vectorizing would > introduce a race. And, it'd be hard for an optimizer to prove that > this doesn't happen.Same answer. I don't think scalar vs. vector matters. This is mostly a cache coherence issue. There is one twist that our vectorization guy pointed out to me. If when vectorizing, we have threads A and B read the entire vector, update the values under mask and then write the entire vector, clearly there will be a data race introduced. The Cray compiler has switches for users to balance safety and performance, since a stride-one load and store is generally much faster than a masked load and store. So for vectorization, the answer is, "it depends on the target consistency model and the style of vectorization chosen."> p.s. Yes, you could also vectorize these with a strided store or a > scatter, but then it raises a different question, of the memory > semantics for strided or scatter stores.And again, the same answer. :) I'm no vectorization expert, but I believe what I said is correct. :) -David
On Thu, May 9, 2013 at 4:47 PM, <dag at cray.com> wrote:> Chandler Carruth <chandlerc at google.com> writes: > > > What are the desired memory model semantics for a masked store? > > Specifically, let me suppose a simplified vector model of <2 x i64> on > > an i64-word-size platform. > > > > masked_store(<42, 42>, Ptr, <true, false>) > > > > Does this write to the entier <2 x i64> object stored at Ptr or not? > > No. It writes one element. >Is this a statement about all of the existing hardware that supports masked stores, or about the desired semantics in your mind for the IR model?> > > Put another way, consider: > > > > thread A: > > ... > > masked_store(<42, 42>, Ptr, <true, false>) > > ... > > > > thread B: > > ... > > masked_store(<42, 42>, Ptr, <false, true>) > > ... > > > > Assuming there is no specific synchronization relevant to Ptr between > > these two threads and their masked stores, does this form a data race > > or not? > > It entirely depends on the hardware implementation.Well, in practice yes, but I'm asking how it should be modeled in the IR. We aren't constrained to the option of having a *different* masked store intrinsic for every hardware architecture that supports masked stores, and it would seem strange to not look for a reasonable target independent abstraction which we can teach the middle-end optimizers about (even if it does take the form of intrinsics). Maybe there is no such abstraction? That in and of itself would be surprising to me.> In most cases I > would say yes due to cache conherence issues. From a purely theoretical > machine that doesn't have false sharing, there would be no data race. >I think you're trying to reason about this from a hardware perspective, and I'm trying to talk about what the right theoretical model for the memory model is... While hardware is one constraint on that, there are others as well, so it's worrisome to talk about both a theoretical machine without false sharing and cache coherency issues when trying to determine if two operations form a data race.> > Of course this assumes that thread B won't access the element stored by > thread A and vice versa. >If we assume that, we have the conclusion -- there is on datarace. The entire question is whether or not the masked element is notionally "stored" (but without changing the value afterward), or not.>From Nadav's link, for example, it appears that AVX *does* actually do afull store of the 256-bit vector, but it does it atomically which precludes data races.> From a memory model perspective, if this does *not* form a data race, > > that makes this tremendously more complex to implement, analyze, and > > optimize... I'm somewhat hopeful that the desired semantics are for > > this to form a datarace (and thus require synchronization when > > occurring in different threads like this). > > Most of the time the compiler will not know the mask value and will have > to be conservative. As Nadav has pointed out, what constitutes > "conservative" is entirely context-dependent.> But I don't understand why defining this as not being a data race would > complicate things. I'm assuming the mask values are statically known. > Can you explain a bit more? >If my example would form a datarace, then when the optimizer sees such a non-atomic stores, *and doesn't know the mask statically* (as I agree, that is the common case), it would know that the entire vector store was independent from stores to the same memory locations in other threads. It could even model the operation of a masked store as a load from that address, masking both the loaded vector and the incoming vector (literally), or-ing (or blending) them together, and storing the result back out. This is a fairly simple model, and easy to reason about. I would even suggest that perhaps this is how we should represent it in IR. However, if my example does not form a datarace, then when the optimizer sees such a non-atomic store, *and doesn't know the mask statically*, it has to assume that the mask may dynamically preclude storing to a memory location that is being concurrently accessed. It cannot speculatively load the vector stored there, and perform an explicit mask and blend and a full store. It essentially means that if we see a masked store and don't know the mask, then even if we know the address statically, that doesn't matter because the mask could effectively index into that address and select a single element to store to. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20130512/0779ac21/attachment.html>
Chandler Carruth <chandlerc at google.com> writes:> > What are the desired memory model semantics for a masked store? > > Specifically, let me suppose a simplified vector model of <2 x > i64> on > > an i64-word-size platform. > > > > masked_store(<42, 42>, Ptr, <true, false>) > > > > Does this write to the entier <2 x i64> object stored at Ptr or > not? > > > No. It writes one element. > > Is this a statement about all of the existing hardware that supports > masked stores, or about the desired semantics in your mind for the IR > model?I made the comment thinking about hardware. If it were to write all elements, what would it write? The old value? That could be useful in some cases (the performance issue I mentioned below). But it also presents problems for mem2reg/SSA. This discussion might lead us to wanting a couple of flavors of masked load and store. I'm not sure.> > > Put another way, consider: > > > > thread A: > > ... > > masked_store(<42, 42>, Ptr, <true, false>) > > ... > > > > thread B: > > ... > > masked_store(<42, 42>, Ptr, <false, true>) > > ... > > > > Assuming there is no specific synchronization relevant to Ptr > between > > these two threads and their masked stores, does this form a data > race > > or not? > > > It entirely depends on the hardware implementation. > > Well, in practice yes, but I'm asking how it should be modeled in the > IR. We aren't constrained to the option of having a *different* masked > store intrinsic for every hardware architecture that supports masked > stores, and it would seem strange to not look for a reasonable target > independent abstraction which we can teach the middle-end optimizers > about (even if it does take the form of intrinsics). Maybe there is no > such abstraction? That in and of itself would be surprising to me.I agree. We should choose the semantics that gives the optimizer the most freedom.> From Nadav's link, for example, it appears that AVX *does* actually do > a full store of the 256-bit vector, but it does it atomically which > precludes data races.By "atomically," you mean that all elements are written before any other operation is allowed to read or store to them?> If my example would form a datarace, then when the optimizer sees such > a non-atomic stores, *and doesn't know the mask statically* (as I > agree, that is the common case), it would know that the entire vector > store was independent from stores to the same memory locations in > other threads. It could even model the operation of a masked store as > a load from that address, masking both the loaded vector and the > incoming vector (literally), or-ing (or blending) them together, and > storing the result back out. This is a fairly simple model, and easy > to reason about. I would even suggest that perhaps this is how we > should represent it in IR.Note that from a hardware perspective, the store may or may not cause a data race depending on alignment and whether the store crosses a cache line boundary.> However, if my example does not form a datarace, then when the > optimizer sees such a non-atomic store, *and doesn't know the mask > statically*, it has to assume that the mask may dynamically preclude > storing to a memory location that is being concurrently accessed. It > cannot speculatively load the vector stored there, and perform an > explicit mask and blend and a full store. It essentially means that if > we see a masked store and don't know the mask, then even if we know > the address statically, that doesn't matter because the mask could > effectively index into that address and select a single element to > store to.You want to do a full vector load, update and store. So do we most of the time, for performance. :) I think you may be getting hung up a bit. What you describe in this paragraph isn't a masked store at all. In fact you explicitly state it's "a full store." Given your code snippet, if we assume there is no data race in the scalar code *and* we assume that vector stores are atomic, then the compiler has two choices on how to translate the code. It can be unsafe and do the fast full load, merge, full store sequence or it can do a slower hardware masked store. The Cray compiler will do both depending on switches or directives given by the user. It is too difficult to know statically which is the best choice. I believe that by default we do the fast code and the user can turn on switches to generate safe code and see if that fixes problems. :) I think we wil need both options in LLVM. There's just no way to pick one and always have the write answer. I think we only need one masked store intrinsic. That intrinsics would *not* write to masked elements. If the compiler wants to be entirely safe, it should use that. Otherwise it should feel free to use the full load/merge/full store sequence. I will run this by our vector expert to see what he thinks. -David