Owen Anderson
2015-Jan-06  06:51 UTC
[LLVMdev] [RFC][PATCH][OPENCL] synchronization scopes redux
Hi Sameer,> On Jan 5, 2015, at 4:51 AM, Sahasrabuddhe, Sameer <Sameer.Sahasrabuddhe at amd.com> wrote: > > Right. The second version of my patches fixes the bitcode encoding. But now I see another potential problem with future bitcode if we require an ordering on the scopes. What happens when a backend later introduces a new scope that goes into the middle of the order? If they renumber the scopes to accomodate this, then existing bitcode for that backend will no longer work. The bitcode reader/writer cannot compensate for this since the values are backend-specific. If we agree that this problem is real, then we cannot force an ordering on the scope numbers.That’s an interesting consideration, and something I hadn’t thought of. I’m unsure offhand of how much it matters in practice. The alternative, I suppose, is having something like string-named scopes, but then we can’t do much with them at the IR level.> So far, I have refrained from proposing a keyword for cross thread scope in the text format, because (a) there never was one and (b) it is not strictly needed since it is the default anyway. I am fine either way, but we will first have to decide what the new keyword should be. I find "allthreads" to be a decent counterpart for "singlethread" ... "crossthread" is not good enough since intermediate scopes have multiple threads too.This actually raises another question. In principle, the “most visible” scope ought to be something like “system” or “device”, meaning a completely uncached memory access that is visible to all peripherals in a heterogeneous system. However, this is almost certainly not what we want to have for typical memory accesses. To summarize, a prototypical scope nest, from most to least visible (aka least to most cacheable) might look like: System —> AllThreads —> Various target-specific local scopes —> SingleThread If we wanted to go really gonzo, there could be a Network scope at the beginning for large-scale HPC systems, but I’m not sure how important that is to anyone. As a related question, do we actually need the local scopes to be target specific? Are there systems, real or planned, that *aren’t* captured by: [Network —> ] System —> AllThreads —> ThreadGroup —> SingleThread ? —Owen
Chandler Carruth
2015-Jan-06  07:31 UTC
[LLVMdev] [RFC][PATCH][OPENCL] synchronization scopes redux
On Mon, Jan 5, 2015 at 10:51 PM, Owen Anderson <resistor at mac.com> wrote:> Hi Sameer, > > > On Jan 5, 2015, at 4:51 AM, Sahasrabuddhe, Sameer < > Sameer.Sahasrabuddhe at amd.com> wrote: > > > > Right. The second version of my patches fixes the bitcode encoding. But > now I see another potential problem with future bitcode if we require an > ordering on the scopes. What happens when a backend later introduces a new > scope that goes into the middle of the order? If they renumber the scopes > to accomodate this, then existing bitcode for that backend will no longer > work. The bitcode reader/writer cannot compensate for this since the values > are backend-specific. If we agree that this problem is real, then we cannot > force an ordering on the scope numbers. > > That’s an interesting consideration, and something I hadn’t thought of. > I’m unsure offhand of how much it matters in practice. The alternative, I > suppose, is having something like string-named scopes, but then we can’t do > much with them at the IR level. >This has me somewhat non-plussed as well.> > > So far, I have refrained from proposing a keyword for cross thread scope > in the text format, because (a) there never was one and (b) it is not > strictly needed since it is the default anyway. I am fine either way, but > we will first have to decide what the new keyword should be. I find > "allthreads" to be a decent counterpart for "singlethread" ... > "crossthread" is not good enough since intermediate scopes have multiple > threads too. > > This actually raises another question. In principle, the “most visible” > scope ought to be something like “system” or “device”, meaning a completely > uncached memory access that is visible to all peripherals in a > heterogeneous system. However, this is almost certainly not what we want > to have for typical memory accesses. > > To summarize, a prototypical scope nest, from most to least visible (aka > least to most cacheable) might look like: > > System —> AllThreads —> Various target-specific local scopes —> > SingleThread > > If we wanted to go really gonzo, there could be a Network scope at the > beginning for large-scale HPC systems, but I’m not sure how important that > is to anyone. >I probably *should* be in a position to be very interested in such a concept.... but honestly, I'm not. If I ever wanted to do something like this, I would just define the large-scale HPC system as the "system" and a single machine/node as some "local" scope.> > As a related question, do we actually need the local scopes to be target > specific? Are there systems, real or planned, that *aren’t* captured by: > > [Network —> ] System —> AllThreads —> ThreadGroup —> SingleThread ? >Sadly, I don't think this will work. In particular, there are real-world accelerators with multiple tiers of thread groups that are visible in the cache hierarchy subsystem. I'm starting to think we might actually need to let the target define acceptable strings for memory scopes and a strict weak ordering over them.... That's really complex and heavy weight, but I'm not really confident that we're safe committing to something more limited. The good side is that we can add the SWO-stuff lazily as needed... Dunno, thoughts? -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150105/a2293e34/attachment.html>
Sahasrabuddhe, Sameer
2015-Jan-06  08:31 UTC
[LLVMdev] [RFC][PATCH][OPENCL] synchronization scopes redux
On 1/6/2015 1:01 PM, Chandler Carruth wrote:> > On Mon, Jan 5, 2015 at 10:51 PM, Owen Anderson <resistor at mac.com > <mailto:resistor at mac.com>> wrote: > > Hi Sameer, > > > On Jan 5, 2015, at 4:51 AM, Sahasrabuddhe, Sameer > <Sameer.Sahasrabuddhe at amd.com > <mailto:Sameer.Sahasrabuddhe at amd.com>> wrote: > > > > Right. The second version of my patches fixes the bitcode > encoding. But now I see another potential problem with future > bitcode if we require an ordering on the scopes. What happens when > a backend later introduces a new scope that goes into the middle > of the order? If they renumber the scopes to accomodate this, then > existing bitcode for that backend will no longer work. The bitcode > reader/writer cannot compensate for this since the values are > backend-specific. If we agree that this problem is real, then we > cannot force an ordering on the scope numbers. > > That’s an interesting consideration, and something I hadn’t > thought of. I’m unsure offhand of how much it matters in > practice. The alternative, I suppose, is having something like > string-named scopes, but then we can’t do much with them at the IR > level. > > > This has me somewhat non-plussed as well.That really depends on what we want to do at the IR level. Scopes do not affect transformations that move non-atomic accesses around atomic accesses. The scope on the atomic access should not matter to the non-atomic accesses. The interesting case is when the compiler tries to optimize atomic accesses with respect to each other, and their scopes do not match. But it might be sufficient to leave such transformations to the target, since quite possibly, other target-specific information might be necessary to make them work or even to say whether they are beneficial.> > > So far, I have refrained from proposing a keyword for cross > thread scope in the text format, because (a) there never was one > and (b) it is not strictly needed since it is the default anyway. > I am fine either way, but we will first have to decide what the > new keyword should be. I find "allthreads" to be a decent > counterpart for "singlethread" ... "crossthread" is not good > enough since intermediate scopes have multiple threads too. > > This actually raises another question. In principle, the “most > visible” scope ought to be something like “system” or “device”, > meaning a completely uncached memory access that is visible to all > peripherals in a heterogeneous system. However, this is almost > certainly not what we want to have for typical memory accesses. > > To summarize, a prototypical scope nest, from most to least > visible (aka least to most cacheable) might look like: > > System —> AllThreads —> Various target-specific local scopes > —> SingleThread > > If we wanted to go really gonzo, there could be a Network scope at > the beginning for large-scale HPC systems, but I’m not sure how > important that is to anyone. > > > I probably *should* be in a position to be very interested in such a > concept.... but honestly, I'm not. If I ever wanted to do something > like this, I would just define the large-scale HPC system as the > "system" and a single machine/node as some "local" scope.I agree. The most accurate description of the highest scope is "address space scope", i.e., all threads that can access the address space being accessed. From that view, it does not matter if the threads are local, remote or situated on different devices, or such. It makes sense to not specify any keyword for this scope, and just say that "synchscope(0)" is default and need not be specified. Any other scope is an explicit optimization over a narrower set of threads.> As a related question, do we actually need the local scopes to be > target specific? Are there systems, real or planned, that > *aren’t* captured by: > > [Network —> ] System —> AllThreads —> ThreadGroup —> SingleThread ? > > > Sadly, I don't think this will work. In particular, there are > real-world accelerators with multiple tiers of thread groups that are > visible in the cache hierarchy subsystem.The HSAIL 1.0 provisional spec has the following scopes: workitem, wavefront, workgroup, component, system. A component is anything that supports the HSAIL instruction set and can execute commands dispatched to it. I am not an authority on this, but to me, it is conceivable that there could be other scopes later, analogous to things such as one "die" or one "chip" or one "board" or one node in a cloud.> > I'm starting to think we might actually need to let the target define > acceptable strings for memory scopes and a strict weak ordering over > them.... That's really complex and heavy weight, but I'm not really > confident that we're safe committing to something more limited. The > good side is that we can add the SWO-stuff lazily as needed... > > Dunno, thoughts?Just the thought of using strings in the IR smells like over-design to me. Going back to the original point, are target-independent optimizations on scoped atomic operations really so attractive? But while the topic is wide open, here's another possibly whacky approach: we let the scopes be integers, and add a "scope layout" string similar to data-layout. The string encodes the ordering of the integers. If it is empty, then simple numerical comparisons are sufficient. Else the string spells out the exact ordering to be used. Any known current target will be happy with the first option. If some target inserts an intermediate scope in the future, then that version switches from empty to a fully specified string. The best part is that we don't even need to do this right now, and only come up with a "scope layout" spec when we really hit the problem for some future target. Sameer. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150106/c5f63ea8/attachment.html>