Sahasrabuddhe, Sameer
2014-Nov-19 17:54 UTC
[LLVMdev] memory scopes in atomic instructions
On 11/19/2014 4:05 AM, Chandler Carruth wrote:> > On Fri, Nov 14, 2014 at 1:09 PM, Sahasrabuddhe, Sameer > <sameer.sahasrabuddhe at amd.com <mailto:sameer.sahasrabuddhe at amd.com>> > wrote: > > 1. Update the synchronization scope field in atomic instructions > from a > single bit to a wider field, say 32-bit unsigned integer. > > > I think this should be an arbitrary bit width integer. I think baking > any size into this is a mistake unless that size is "1".I noticed that the LRM never specifies a width for address spaces, but the implementation uses "unsigned" everywhere, which is clearly not an arbitrary width integer. Is this how memory scopes should also be implemented?> 4. The use of "single thread scope" is not clear. > > > Consider trying to read from memory written in a thread from a signal > handler delivered to that thread. Essentially, there may be a need to > write code which we know will execute in a single hardware thread, but > where the compiler optimizations precluded by atomics need to be > precluded as the control flow within the hardware thread may > arbitrarily move from one sequence of instructions to another. > > If it is required in > target-independent transforms, > > > Yes, it is. sig_atomic_t.Thanks! This also explains why SingleThread is baked into tsan. I couldn't find a way to work around __tsan_atomic_signal_fence if I removed SingleThread as a well-known memory scope.> > 5. Possibly add the following constraint on memory scopes: "The scope > represented by a larger value is nested inside (is a proper subset > of) the scope represented by a smaller value." This would also > imply > that the value used for single-thread scope must be the largest > value used by the target. > This constraint on "nesting" is easily satisfied by HSAIL (and also > OpenCL), where synchronization scopes increase from a single > work-item to the entire system. But it is conceivable that other > targets do not have this constraint. For example, a platform may > define synchronization scopes in terms of overlapping sets instead > of proper subsets. > > > I think this is the important thing to settle on in the design. I'd > really like to hear from a diverse set of vendors and folks operating > in the GPU space to understand whether having this constraint is > critically important or problematic for any reasons.I think "heterogenous systems" (in general, and not just HSA) might be a better term since it covers more than just GPU devices. Also, I don't see why this constraint in the general LLVM IR could be critically important to some target. But I can see why it could be problematic for a target! If I understand this correctly, the main issue is that if we do not build nested scopes into the IR, then we can never have target-independent optimizations that work with multiple memory scopes. Is that correct? Is that really so important? What happens when we do have a target that does not have nested memory scopes? Will it not be harder to remove this assumption from the target-independent optimizations?> I think (unfortunately) it would be hard to add this later...I am not sure I understand this part. The only effect I see is that targets might use enumerations that do not follow a strict order in their list of memory scopes. We can always encourage a future-looking convention to list the memory scopes in nesting order. And in the worst case, the enumerations can be reordered when the need arises, right? Sameer.
Sahasrabuddhe, Sameer
2014-Nov-21 03:31 UTC
[LLVMdev] memory scopes in atomic instructions
On 11/19/2014 11:24 PM, Sahasrabuddhe, Sameer wrote:>> 5. Possibly add the following constraint on memory scopes: "The >> scope >> represented by a larger value is nested inside (is a proper >> subset >> of) the scope represented by a smaller value." This would also >> imply >> that the value used for single-thread scope must be the largest >> value used by the target. >> This constraint on "nesting" is easily satisfied by HSAIL (and >> also >> OpenCL), where synchronization scopes increase from a single >> work-item to the entire system. But it is conceivable that other >> targets do not have this constraint. For example, a platform may >> define synchronization scopes in terms of overlapping sets >> instead >> of proper subsets. >> >> I think this is the important thing to settle on in the design. I'd >> really like to hear from a diverse set of vendors and folks operating >> in the GPU space to understand whether having this constraint is >> critically important or problematic for any reasons. > > I think "heterogenous systems" (in general, and not just HSA) might be > a better term since it covers more than just GPU devices.I asked around inside AMD. The feedback that I got is that assuming nested scopes for GPUs is reasonable, but the concern about future targets is valid on general principles. I still feel that it is a bit premature to build this assumption into LLVM ... we can at least wait until we encounter the first analysis or transformation that needs to depend on nested scopes! Note that this only affects the comparison of atomic accesses to each other, and that too, only when their scopes do not match. It does not affect how non-atomic accesses are optimized with respect to atomic accesses. Meanwhile, I will start working on the required changes. The only decision required right now is whether to use "1" for SingleThread, or the largest integer value. Sameer.
Sahasrabuddhe, Sameer
2014-Nov-24 09:54 UTC
[LLVMdev] memory scopes in atomic instructions
On 11/21/2014 9:01 AM, Sahasrabuddhe, Sameer wrote:> Meanwhile, I will start working on the required changes. The only > decision required right now is whether to use "1" for SingleThread, or > the largest integer value.I am currently trying to understand how the bitcode reader and writer work, and it will certainly take some time. Besides, that the SDNode class needs to be updated to use a separate unsigned field for the synchronization scope. Meanwhile, here's a patch that shows the immediate effect of allowing more than two scopes. There seem to be only two users for now --- the X86 target and tsan. The patch attempts to fix them in slightly different ways: 1. The X86 target understands only two scopes --- SingleThread or CrossThread (to be renamed later). All other "not SingleThread" scopes will be treated as CrossThread. I am not sure if this needs an assertion instead ... other scopes should not occur in the program anyway! Interestingly, this sort of "scope promotion" is supposed to be okay for OpenCL and will not introduce any new data races. The frontend can do this explicitly when translating OpenCL programs to targets with just two scopes. 2. tsan has a similar problem, but it should definitely assert for unknown scopes. If it attempted to promote them instead, then it will hide data races that it was supposed to find! The assertion can be replaced if/when tsan learns how to model other scopes for a given target. Sameer. -------------- next part -------------- commit 847121c1ba4b49f459a7f72c7527b82435b377f9 Author: Sameer Sahasrabuddhe <sameer.sahasrabuddhe at amd.com> Date: Mon Nov 24 12:03:21 2014 +0530 replace check for "CrossThread" with "not SingleThread" diff --git a/lib/Target/X86/X86ISelLowering.cpp b/lib/Target/X86/X86ISelLowering.cpp index 47c8ce0..022a6e2 100644 --- a/lib/Target/X86/X86ISelLowering.cpp +++ b/lib/Target/X86/X86ISelLowering.cpp @@ -18736,7 +18736,7 @@ static SDValue LowerATOMIC_FENCE(SDValue Op, const X86Subtarget *Subtarget, // The only fence that needs an instruction is a sequentially-consistent // cross-thread fence. - if (FenceOrdering == SequentiallyConsistent && FenceScope == CrossThread) { + if (FenceOrdering == SequentiallyConsistent && FenceScope != SingleThread) { if (hasMFENCE(*Subtarget)) return DAG.getNode(X86ISD::MFENCE, dl, MVT::Other, Op.getOperand(0)); diff --git a/lib/Transforms/Instrumentation/ThreadSanitizer.cpp b/lib/Transforms/Instrumentation/ThreadSanitizer.cpp index 417f2a1..f4cd8bd 100644 --- a/lib/Transforms/Instrumentation/ThreadSanitizer.cpp +++ b/lib/Transforms/Instrumentation/ThreadSanitizer.cpp @@ -298,9 +298,9 @@ void ThreadSanitizer::chooseInstructionsToInstrument( static bool isAtomic(Instruction *I) { if (LoadInst *LI = dyn_cast<LoadInst>(I)) - return LI->isAtomic() && LI->getSynchScope() == CrossThread; + return LI->isAtomic() && LI->getSynchScope() != SingleThread; if (StoreInst *SI = dyn_cast<StoreInst>(I)) - return SI->isAtomic() && SI->getSynchScope() == CrossThread; + return SI->isAtomic() && SI->getSynchScope() != SingleThread; if (isa<AtomicRMWInst>(I)) return true; if (isa<AtomicCmpXchgInst>(I)) @@ -539,8 +539,14 @@ bool ThreadSanitizer::instrumentAtomic(Instruction *I) { I->eraseFromParent(); } else if (FenceInst *FI = dyn_cast<FenceInst>(I)) { Value *Args[] = {createOrdering(&IRB, FI->getOrdering())}; - Function *F = FI->getSynchScope() == SingleThread ? - TsanAtomicSignalFence : TsanAtomicThreadFence; + unsigned SynchScope = FI->getSynchScope(); + Function *F = NULL; + switch (SynchScope) { + case SingleThread: F = TsanAtomicSignalFence; break; + case CrossThread: F = TsanAtomicThreadFence; break; + default: + llvm_unreachable("unsupported synchronization scope"); + } CallInst *C = CallInst::Create(F, Args); ReplaceInstWithInst(I, C); }