thr3ads.net - llvm dev - [llvm-dev] [RFC] Adding thread group semantics to LangRef (motivated by GPUs) [Feb 2019]

If this information is useful, please help other people find it:
Share via:

ramshankar r via llvm-dev

2019-Feb-02 18:35 UTC

[llvm-dev] [RFC] Adding thread group semantics to LangRef (motivated by GPUs)

Hi,

I am also working on compiling for SIMT GPUs. It am still trying to factor
all of the discussions here and have a few questions (having seen some
really unintelligible looking bugs and loads of confusion around
maintaining values in vector or scalar registers). I am a couple of months
late to this thread but I would like it very much if you would give me
feedback on understanding these issues.

It appears to me that there are two main issues in this discussion - one
concerning code-motion of sync functions across their controlling
conditions: moving ballot earlier or later in control flow: closer to the
CFG entry node or to the exit node when you look at the graph without
back-edges. For managing this issue, an 'anticonvergent' property is
suggested. That looks similar to adding a control-dependence edge to this
'sync' function. Imagine that every controlling edge in the graph (edges
with TRUE or FALSE labels) is split and an convenience instruction added in
that newly formed BB- call it a "control-dependence" intrinsic. After
that,
imagine adding the result of this intrinsic to all of the synchronization
functions like ballot as one of the arguments. This still allows code
motion of the function into an inside or enclosed control-dependence and
for that I think the 'convergent' property is suggested. Technically,
there
is only one 'control-dependence' value when executing with a single PC,
SIMT style, like the EXEC register and moving the function with a
control-dependence into a region with another control-dependence must
always be  disallowed. But with SSA form each of the intrinsics are going
to have to define a new control-variable and we would thus need
'convergent' property. So, 'anticonvergent' is to add an
incoming
control-dependence and 'convergent' to always preserve the
control-dependence. Control dependence is not new to LLVM and it is pretty
much always there with RIDFs used by ADCE etc. Both convergent and
anticonvergent looks 'anti' to the SSA idea of always defining something
new. But SSA or not, synchronization is applicable on an 'actual'
variable
like a 'Stack variable' and not an SSA def and we can manage to get
around
with the 'anticonvergent' and 'convergent' properties. Also
'convergent'
property is based on the idea of enforcing presence of the function only at
'join' points like where we put PHI nodes. Would that be fair to say?

The second issue seems to be about lack of control-flow merge points, for
which, it has been noted, that there is no dedicated 'continue' blocks
to
go-to, before taking back-edges. That is, there is a danger of not merging
all straight-line control inside loops, before taking back-edges. I have
heard about the loop-simplify pass in LLVM, which changes loops to ensure
there is only one back-edge in them. That would change loops to have that
dedicated latch block which could be used as a merge block.  What is
missing in my understanding here?

Another approach that I read somewhere drops all break statements
cold-turkey (break change EXEC masks but no SI_IF/ELSE for them) and let
control-flow reach the latch eventually. This might be like a thread-kill
instruction that sets EXEC to 0 for a lane, without a SBRANCH_EXEC branch.
This should mean no PHI nodes for breaks except for the one in the loop
header. There may still be PHI-nodes retained for joins of provably uniform
branches. Dynamic non-break branches, if preserved for the branch-any
heuristic (SI_IF/SI_ELSE pseudo branch which becomes SBRANCH_EXEC etc),
would have to be linearizing PHI operands and each break should be blended
with an incoming PHI. This should preserve intended behavior after PHI
lowering.

Elementary predication is shown to work with vectorization even for with
irreducible control flow, as described by Karrenberg et all, Moll et al in
their WFV works. In fact they turn varying break conditions into scalars,
still in SSA form, and examine the scalar using the llvm any-intrinsic. If
scalar-branches needed to be preserved, they came up with an approach when
CFG does not have irreducible control-flow and has only natural loops(Moll
et al, PLDI2018). Given the lack of clarity around irreducible control flow
and preserving uniform branches, I would still be pretty happy if there is
something that works for unstructured control flow with natural loops in
them. But there are other pluses like they allow loops with multiple break
targets and track break masks for each target. In addition they enabled
branch_any, which will be taking advantage of branches which become uniform
only at run-time, much like SI SBRANCH_EXEC. Conceptually they would be
needing to the same things. But SSA/LLVM etc is still confusing at least to
me when it comes to GPU SIMT backends and preserving break conditions in
scalar registers etc. As I mentioned earlier, looking for your feedback,

Regards,
Ramshankar
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20190202/7b71d4f9/attachment.html>

Nicolai Hähnle via llvm-dev

2019-Feb-04 07:59 UTC

head link

[llvm-dev] [RFC] Adding thread group semantics to LangRef (motivated by GPUs)

Hi Ramshankar,

On 02.02.19 19:35, ramshankar r via llvm-dev wrote:> I am also working on compiling for SIMT GPUs. It am still trying to 
> factor all of the discussions here and have a few questions (having seen 
> some really unintelligible looking bugs and loads of confusion around 
> maintaining values in vector or scalar registers). I am a couple of 
> months late to this thread but I would like it very much if you would 
> give me feedback on understanding these issues.
It's still pretty much an ongoing discussion, so thanks for joining in :)

> It appears to me that there are two main issues in this discussion - one 
> concerning code-motion of sync functions across their controlling 
> conditions: moving ballot earlier or later in control flow: closer to 
> the CFG entry node or to the exit node when you look at the graph 
> without back-edges. For managing this issue, an 'anticonvergent' 
> property is suggested.
I am actually against 'anticonvergent', because 'convergent'
already has
problems that I like to call 'spooky action at a distance'.

Specifically, the jump threading pass as it currently exists in LLVM is 
incorrect in the presence of convergent calls.

> That looks similar to adding a control-dependence 
> edge to this 'sync' function. Imagine that every controlling edge
in the
> graph (edges with TRUE or FALSE labels) is split and an convenience 
> instruction added in that newly formed BB- call it a 
> "control-dependence" intrinsic. After that, imagine adding the
result of
> this intrinsic to all of the synchronization functions like ballot as 
> one of the arguments. This still allows code motion of the function into 
> an inside or enclosed control-dependence and for that I think the 
> 'convergent' property is suggested. Technically, there is only one 
> 'control-dependence' value when executing with a single PC, SIMT
style,
> like the EXEC register and moving the function with a control-dependence 
> into a region with another control-dependence must always be 
>   disallowed. But with SSA form each of the intrinsics are going to have 
> to define a new control-variable and we would thus need
'convergent'
> property. So, 'anticonvergent' is to add an incoming
control-dependence
> and 'convergent' to always preserve the control-dependence. Control
> dependence is not new to LLVM and it is pretty much always there with 
> RIDFs used by ADCE etc. Both convergent and anticonvergent looks
'anti'
> to the SSA idea of always defining something new. But SSA or not, 
> synchronization is applicable on an 'actual' variable like a
'Stack
> variable' and not an SSA def and we can manage to get around with the 
> 'anticonvergent' and 'convergent' properties. Also
'convergent' property
> is based on the idea of enforcing presence of the function only at 
> 'join' points like where we put PHI nodes. Would that be fair to
say?
It would be nice to see some explicit examples because I'm not sure I'm 
following exactly what you're proposing.

There have been a number of flavors of proposals where sync functions 
get an additional argument that directly or indirectly represents the 
set of threads participating in the sync. The main question such 
proposals need to answer is: How do you ensure that the intrinsic call 
which *produces* that value isn't moved in an incorrect way?

> The second issue seems to be about lack of control-flow merge points, 
> for which, it has been noted, that there is no dedicated 'continue'
> blocks to go-to, before taking back-edges. That is, there is a danger of 
> not merging all straight-line control inside loops, before taking 
> back-edges. I have heard about the loop-simplify pass in LLVM, which 
> changes loops to ensure there is only one back-edge in them. That would 
> change loops to have that dedicated latch block which could be used as a 
> merge block.  What is missing in my understanding here?
Unfortunately, you can't just wish the problem away like this. You have 
to define the semantics for *all* legal IR, and that includes loops with 
multiple back edges. (It also includes irreducible control flow!)

> Another approach that I read somewhere drops all break statements 
> cold-turkey (break change EXEC masks but no SI_IF/ELSE for them) and let 
> control-flow reach the latch eventually. This might be like a 
> thread-kill instruction that sets EXEC to 0 for a lane, without a 
> SBRANCH_EXEC branch. This should mean no PHI nodes for breaks except for 
> the one in the loop header. There may still be PHI-nodes retained for 
> joins of provably uniform branches. Dynamic non-break branches, if 
> preserved for the branch-any heuristic (SI_IF/SI_ELSE pseudo branch 
> which becomes SBRANCH_EXEC etc), would have to be linearizing PHI 
> operands and each break should be blended with an incoming PHI. This 
> should preserve intended behavior after PHI lowering.
This seems to be more about backend codegen, which tends to be easier 
because you have more room for target-specific approaches.

We're looking to define semantics of LLVM IR.

Cheers,
Nicolai

> Elementary predication is shown to work with vectorization even for with 
> irreducible control flow, as described by Karrenberg et all, Moll et al 
> in their WFV works. In fact they turn varying break conditions into 
> scalars, still in SSA form, and examine the scalar using the llvm 
> any-intrinsic. If scalar-branches needed to be preserved, they came up 
> with an approach when CFG does not have irreducible control-flow and has 
> only natural loops(Moll et al, PLDI2018). Given the lack of clarity 
> around irreducible control flow and preserving uniform branches, I would 
> still be pretty happy if there is something that works for unstructured 
> control flow with natural loops in them. But there are other pluses like 
> they allow loops with multiple break targets and track break masks for 
> each target. In addition they enabled branch_any, which will be taking 
> advantage of branches which become uniform only at run-time, much like 
> SI SBRANCH_EXEC. Conceptually they would be needing to the same things. 
> But SSA/LLVM etc is still confusing at least to me when it comes to GPU 
> SIMT backends and preserving break conditions in scalar registers etc. 
> As I mentioned earlier, looking for your feedback,
> 
> Regards,
> Ramshankar
> 
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> 

-- 
Lerne, wie die Welt wirklich ist,
Aber vergiss niemals, wie sie sein sollte.

llvm dev - Feb 2019 - [RFC] Adding thread group semantics to LangRef (motivated by GPUs)

[llvm-dev] [RFC] Adding thread group semantics to LangRef (motivated by GPUs)

[llvm-dev] [RFC] Adding thread group semantics to LangRef (motivated by GPUs)