Dan Gohman <dan433584 at gmail.com> writes:> But I don't understand why defining this as not being a data race > would complicate things. I'm assuming the mask values are > statically known. Can you explain a bit more? > > It's an interesting question for autovectorization, for example. > > Thread A: > for (i=0;i<n;++i) > if (i&1) > X[i] = 0; > > Thread B: > for (i=0;i<n;++i) > if (!(i&1)) > X[i] = 1; > > The threads run concurrently without synchronization. As written, > there is no race.There is no race *if* the hardware cache coherence says so. :) There are false sharing issues here and different machines have behaved very differently in the past. The result entirely depends on the machine's consistency model. LLVM is a virtual machine and the IR should define a consistency model. Everything flows from that. I think ideally we'd define the model such that there is no race in the scalar code and the compiler would be free to vectorize it. This is a very strict consistency model and for targets with relaxed semantics, LLVM would have to insert synchronization operations or choose not to vectorize. Presumably if the scalar code were problematic on a machine with relaxed consistency, the user would have added synchronization primitives and vectorization would not be possible.> Can you vectorize either of these loops? If masked-out elements of a > predicated store are "in play" for racing, then vectorizing would > introduce a race. And, it'd be hard for an optimizer to prove that > this doesn't happen.Same answer. I don't think scalar vs. vector matters. This is mostly a cache coherence issue. There is one twist that our vectorization guy pointed out to me. If when vectorizing, we have threads A and B read the entire vector, update the values under mask and then write the entire vector, clearly there will be a data race introduced. The Cray compiler has switches for users to balance safety and performance, since a stride-one load and store is generally much faster than a masked load and store. So for vectorization, the answer is, "it depends on the target consistency model and the style of vectorization chosen."> p.s. Yes, you could also vectorize these with a strided store or a > scatter, but then it raises a different question, of the memory > semantics for strided or scatter stores.And again, the same answer. :) I'm no vectorization expert, but I believe what I said is correct. :) -David
On Thu, May 9, 2013 at 6:04 PM, <dag at cray.com> wrote:> Dan Gohman <dan433584 at gmail.com> writes: > > > But I don't understand why defining this as not being a data race > > would complicate things. I'm assuming the mask values are > > statically known. Can you explain a bit more? > > > > It's an interesting question for autovectorization, for example. > > > > Thread A: > > for (i=0;i<n;++i) > > if (i&1) > > X[i] = 0; > > > > Thread B: > > for (i=0;i<n;++i) > > if (!(i&1)) > > X[i] = 1; > > > > The threads run concurrently without synchronization. As written, > > there is no race. > > There is no race *if* the hardware cache coherence says so. :) There > are false sharing issues here and different machines have behaved very > differently in the past. >Let's not conflate races with false sharing. They're totally different, and false sharing is *not* what we're discussing here.> The result entirely depends on the machine's consistency model. > > LLVM is a virtual machine and the IR should define a consistency model. > Everything flows from that. I think ideally we'd define the model such > that there is no race in the scalar code and the compiler would be free > to vectorize it. This is a very strict consistency model and for > targets with relaxed semantics, LLVM would have to insert > synchronization operations or choose not to vectorize.LLVM already has a memory model. We don't need to add one. ;] It's here for reference: http://llvm.org/docs/LangRef.html#memmodel Also, cache coherency is *not* the right way to think of a memory model. It makes it extremely hard to understand and define what optimization passes are allowed to do. I think LLVM's memory model does a very good job of this for both scalar and vector code today. If you spot problems with it, let's start a thread to address them. I suspect myself, Jeffrey, and Owen will all be extremely interested in discussing any such issues. The only thing that isn't in the model that is relevant here is something that isn't in LLVM today -- masked loads and stores. And that was what inspired my original question. =D -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20130512/8e5801e6/attachment.html>
Chandler Carruth <chandlerc at google.com> writes:> There is no race *if* the hardware cache coherence says so. :) > There > are false sharing issues here and different machines have behaved > very > differently in the past. > > Let's not conflate races with false sharing. They're totally > different, and false sharing is *not* what we're discussing here.But in the real world false sharing exists and the compiler has to deal with it. We can say, "make codegen deal with it," but these issues bubble up to the target-independent optimizer nonetheless. A theoretical memory model is good to have but it's often not sufficient.> The result entirely depends on the machine's consistency model. > > LLVM is a virtual machine and the IR should define a consistency > model. > Everything flows from that. I think ideally we'd define the model > such > that there is no race in the scalar code and the compiler would be > free > to vectorize it. This is a very strict consistency model and for > targets with relaxed semantics, LLVM would have to insert > synchronization operations or choose not to vectorize. > > LLVM already has a memory model. We don't need to add one. ;] It's > here for reference: http://llvm.org/docs/LangRef.html#memmodelI started to look at http://llvm.org/docs/Atomics.html first for a genter introduction and immediately spotted a problem. Your first example precluding register promotion for the update of x is hugely pessimistic. I don't particularly care because our optimizer has already done the transformation before we hit LLVM. :) But with that restriction you're leaving a ton of performance on the table. The same goes for vector code generation, in general. Our vectorizer has already done it. But let's get this right for everyone.> The only thing that isn't in the model that is relevant here is > something that isn't in LLVM today -- masked loads and stores. And > that was what inspired my original question. =DFWIW, informally, the Cray compiler ignores any concurrency it did not itself create. It won't generally introduce loads and stores that weren't there, but it will certainly eliminate any loads and stores it can. We do have atomic operations which generally behave like the LLVM atomics. The memory model looks a lot like the C abstract machine. We generally give the compiler free reign. We let the Cray compiler do some unsafe optimization from time to time. Turning a masked load/operation/masked store into a full load/blend/full store is a common case. Users can disable it if they want to be extra careful. We worry about false sharing, but only after a certain point in translation. These have proven to be very practical and effective techniques. I wrote about masked stores vs. full stores in a previous message. I believe mask stores should only write to unmasked elements. It should not trap on masked elements. If a developer needs something more flexible for performance, he or she can do an unsafe transformation, knowing the implications of doing so. -David