thr3ads.net - llvm dev - [llvm-dev] [RFC] Late (OpenMP) GPU code "SPMD-zation" [Mar 2019]

If this information is useful, please help other people find it:
Share via:

Doerfert, Johannes Rudolf via llvm-dev

2019-Jan-22 18:17 UTC

[llvm-dev] [RFC] Late (OpenMP) GPU code "SPMD-zation"

Where we are
------------

Currently, when we generate OpenMP target offloading code for GPUs, we
use sufficient syntactic criteria to decide between two execution modes:
  1)      SPMD -- All target threads (in an OpenMP team) run all the code.
  2) "Guarded" -- The master thread (of an OpenMP team) runs the user
                  code. If an OpenMP distribute region is encountered, thus
                  if all threads (in the OpenMP team) are supposed to
                  execute the region, the master wakes up the idling
                  worker threads and points them to the correct piece of
                  code for distributed execution.

For a variety of reasons we (generally) prefer the first execution mode.
However, depending on the code, that might not be valid, or we might
just not know if it is in the Clang code generation phase.

The implementation of the "guarded" execution mode follows roughly the
state machine description in [1], though the implementation is different
(more general) nowadays.


What we want
------------

Increase the amount of code executed in SPMD mode and the use of
lightweight "guarding" schemes where appropriate.


How we get (could) there
------------------------

We propose the following two modifications in order:

  1) Move the state machine logic into the OpenMP runtime library. That
     means in SPMD mode all device threads will start the execution of
     the user code, thus emerge from the runtime, while in guarded mode
     only the master will escape the runtime and the other threads will
     idle in their state machine code that is now just "hidden".

     Why:
     - The state machine code cannot be (reasonably) optimized anyway,
       moving it into the library shouldn't hurt runtime but might even
       improve compile time a little bit.
     - The change should also simplify the Clang code generation as we
       would generate structurally the same code for both execution modes
       but only the runtime library calls, or their arguments, would
       differ between them.
     - The reason we should not "just start in SPMD mode" and
"repair"
       it later is simple, this way we always have semantically correct
       and executable code.
     - Finally, and most importantly, there is now only little
       difference (see above) between the two modes in the code
       generated by clang. If we later analyze the code trying to decide
       if we can use SPMD mode instead of guarded mode the analysis and
       transformation becomes much simpler.

 2) Implement a middle-end LLVM-IR pass that detects the guarded mode,
    e.g., through the runtime library calls used, and that tries to
    convert it into the SPMD mode potentially by introducing lightweight
    guards in the process.

    Why:
    - After the inliner, and the canonicalizations, we have a clearer
      picture of the code that is actually executed in the target
      region and all the side effects it contains. Thus, we can make an
      educated decision on the required amount of guards that prevent
      unwanted side effects from happening after a move to SPMD mode.
    - At this point we can more easily introduce different schemes to
      avoid side effects by threads that were not supposed to run. We
      can decide if a state machine is needed, conditionals should be
      employed, masked instructions are appropriate, or "dummy" local
      storage can be used to hide the side effect from the outside
      world.


None of this was implemented yet but we plan to start in the immediate
future. Any comments, ideas, criticism is welcome!


Cheers,
  Johannes


P.S. [2-4] Provide further information on implementation and features.

[1] https://ieeexplore.ieee.org/document/7069297
[2] https://dl.acm.org/citation.cfm?id=2833161
[3] https://dl.acm.org/citation.cfm?id=3018870
[4] https://dl.acm.org/citation.cfm?id=3148189


-- 

Johannes Doerfert
Researcher

Argonne National Laboratory
Lemont, IL 60439, USA

jdoerfert at anl.gov
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 228 bytes
Desc: not available
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20190122/a54e6d1d/attachment.sig>

Alexey Bataev via llvm-dev

2019-Jan-22 18:34 UTC

head link

[llvm-dev] [RFC] Late (OpenMP) GPU code "SPMD-zation"

-------------
Best regards,
Alexey Bataev

22.01.2019 13:17, Doerfert, Johannes Rudolf пишет:> Where we are
> ------------
>
> Currently, when we generate OpenMP target offloading code for GPUs, we
> use sufficient syntactic criteria to decide between two execution modes:
>   1)      SPMD -- All target threads (in an OpenMP team) run all the code.
>   2) "Guarded" -- The master thread (of an OpenMP team) runs the
user
>                   code. If an OpenMP distribute region is encountered, thus
>                   if all threads (in the OpenMP team) are supposed to
>                   execute the region, the master wakes up the idling
>                   worker threads and points them to the correct piece of
>                   code for distributed execution.
>
> For a variety of reasons we (generally) prefer the first execution mode.
> However, depending on the code, that might not be valid, or we might
> just not know if it is in the Clang code generation phase.
>
> The implementation of the "guarded" execution mode follows
roughly the
> state machine description in [1], though the implementation is different
> (more general) nowadays.
>
>
> What we want
> ------------
>
> Increase the amount of code executed in SPMD mode and the use of
> lightweight "guarding" schemes where appropriate.
>
>
> How we get (could) there
> ------------------------
>
> We propose the following two modifications in order:
>
>   1) Move the state machine logic into the OpenMP runtime library. That
>      means in SPMD mode all device threads will start the execution of
>      the user code, thus emerge from the runtime, while in guarded mode
>      only the master will escape the runtime and the other threads will
>      idle in their state machine code that is now just "hidden".
>
>      Why:
>      - The state machine code cannot be (reasonably) optimized anyway,
>        moving it into the library shouldn't hurt runtime but might even
>        improve compile time a little bit.
>      - The change should also simplify the Clang code generation as we
>        would generate structurally the same code for both execution modes
>        but only the runtime library calls, or their arguments, would
>        differ between them.
>      - The reason we should not "just start in SPMD mode" and
"repair"
>        it later is simple, this way we always have semantically correct
>        and executable code.
>      - Finally, and most importantly, there is now only little
>        difference (see above) between the two modes in the code
>        generated by clang. If we later analyze the code trying to decide
>        if we can use SPMD mode instead of guarded mode the analysis and
>        transformation becomes much simpler.
The last item is wrong, unfortunately. A lot of things in the codegen
depend on the execution mode, e.g. correct support of the data-sharing.
Of course, we can try to generalize the codegen and rely completely on
the runtime, but the performance is going to be very poor.

We still need static analysis in the compiler. I agree, that it is
better to move this analysis to the backend, at least after the
inlining, but at the moment it is not possible. We need the support for
the late outlining, which will allow to implement better detection of
the SPMD constructs + improve performance.
>
>  2) Implement a middle-end LLVM-IR pass that detects the guarded mode,
>     e.g., through the runtime library calls used, and that tries to
>     convert it into the SPMD mode potentially by introducing lightweight
>     guards in the process.
>
>     Why:
>     - After the inliner, and the canonicalizations, we have a clearer
>       picture of the code that is actually executed in the target
>       region and all the side effects it contains. Thus, we can make an
>       educated decision on the required amount of guards that prevent
>       unwanted side effects from happening after a move to SPMD mode.
>     - At this point we can more easily introduce different schemes to
>       avoid side effects by threads that were not supposed to run. We
>       can decide if a state machine is needed, conditionals should be
>       employed, masked instructions are appropriate, or "dummy"
local
>       storage can be used to hide the side effect from the outside
>       world.
>
>
> None of this was implemented yet but we plan to start in the immediate
> future. Any comments, ideas, criticism is welcome!
>
>
> Cheers,
>   Johannes
>
>
> P.S. [2-4] Provide further information on implementation and features.
>
> [1] https://ieeexplore.ieee.org/document/7069297
> [2] https://dl.acm.org/citation.cfm?id=2833161
> [3] https://dl.acm.org/citation.cfm?id=3018870
> [4] https://dl.acm.org/citation.cfm?id=3148189
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20190122/9ab1b892/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: OpenPGP digital signature
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20190122/9ab1b892/attachment.sig>

Doerfert, Johannes Rudolf via llvm-dev

2019-Jan-22 18:43 UTC

head link

[llvm-dev] [RFC] Late (OpenMP) GPU code "SPMD-zation"

Could you elaborate on what you refer to wrt data sharing. What do we currently
do in the clang code generation that we could not effectively implement in the
runtime, potentially with support of an llvm pass.

Thanks,
  James

Get Outlook for Android<https://aka.ms/ghei36>

________________________________
From: Alexey Bataev <a.bataev at outlook.com>
Sent: Tuesday, January 22, 2019 12:34:01 PM
To: Doerfert, Johannes Rudolf; cfe-dev at lists.llvm.org
Cc: openmp-dev at lists.llvm.org; LLVM-Dev; Finkel, Hal J.; Alexey Bataev;
Arpith Chacko Jacob
Subject: Re: [RFC] Late (OpenMP) GPU code "SPMD-zation"



-------------
Best regards,
Alexey Bataev

22.01.2019 13:17, Doerfert, Johannes Rudolf пишет:

Where we are
------------

Currently, when we generate OpenMP target offloading code for GPUs, we
use sufficient syntactic criteria to decide between two execution modes:
  1)      SPMD -- All target threads (in an OpenMP team) run all the code.
  2) "Guarded" -- The master thread (of an OpenMP team) runs the user
                  code. If an OpenMP distribute region is encountered, thus
                  if all threads (in the OpenMP team) are supposed to
                  execute the region, the master wakes up the idling
                  worker threads and points them to the correct piece of
                  code for distributed execution.

For a variety of reasons we (generally) prefer the first execution mode.
However, depending on the code, that might not be valid, or we might
just not know if it is in the Clang code generation phase.

The implementation of the "guarded" execution mode follows roughly the
state machine description in [1], though the implementation is different
(more general) nowadays.


What we want
------------

Increase the amount of code executed in SPMD mode and the use of
lightweight "guarding" schemes where appropriate.


How we get (could) there
------------------------

We propose the following two modifications in order:

  1) Move the state machine logic into the OpenMP runtime library. That
     means in SPMD mode all device threads will start the execution of
     the user code, thus emerge from the runtime, while in guarded mode
     only the master will escape the runtime and the other threads will
     idle in their state machine code that is now just "hidden".

     Why:
     - The state machine code cannot be (reasonably) optimized anyway,
       moving it into the library shouldn't hurt runtime but might even
       improve compile time a little bit.
     - The change should also simplify the Clang code generation as we
       would generate structurally the same code for both execution modes
       but only the runtime library calls, or their arguments, would
       differ between them.
     - The reason we should not "just start in SPMD mode" and
"repair"
       it later is simple, this way we always have semantically correct
       and executable code.
     - Finally, and most importantly, there is now only little
       difference (see above) between the two modes in the code
       generated by clang. If we later analyze the code trying to decide
       if we can use SPMD mode instead of guarded mode the analysis and
       transformation becomes much simpler.

The last item is wrong, unfortunately. A lot of things in the codegen depend on
the execution mode, e.g. correct support of the data-sharing. Of course, we can
try to generalize the codegen and rely completely on the runtime, but the
performance is going to be very poor.

We still need static analysis in the compiler. I agree, that it is better to
move this analysis to the backend, at least after the inlining, but at the
moment it is not possible. We need the support for the late outlining, which
will allow to implement better detection of the SPMD constructs + improve
performance.



 2) Implement a middle-end LLVM-IR pass that detects the guarded mode,
    e.g., through the runtime library calls used, and that tries to
    convert it into the SPMD mode potentially by introducing lightweight
    guards in the process.

    Why:
    - After the inliner, and the canonicalizations, we have a clearer
      picture of the code that is actually executed in the target
      region and all the side effects it contains. Thus, we can make an
      educated decision on the required amount of guards that prevent
      unwanted side effects from happening after a move to SPMD mode.
    - At this point we can more easily introduce different schemes to
      avoid side effects by threads that were not supposed to run. We
      can decide if a state machine is needed, conditionals should be
      employed, masked instructions are appropriate, or "dummy" local
      storage can be used to hide the side effect from the outside
      world.


None of this was implemented yet but we plan to start in the immediate
future. Any comments, ideas, criticism is welcome!


Cheers,
  Johannes


P.S. [2-4] Provide further information on implementation and features.

[1] https://ieeexplore.ieee.org/document/7069297
[2] https://dl.acm.org/citation.cfm?id=2833161
[3] https://dl.acm.org/citation.cfm?id=3018870
[4] https://dl.acm.org/citation.cfm?id=3148189



-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20190122/5f063889/attachment.html>

Doerfert, Johannes via llvm-dev

2019-Jan-31 00:56 UTC

head link

[llvm-dev] [RFC] Late (OpenMP) GPU code "SPMD-zation"

Hi Doru,

[+ llvm-dev and cfe-dev]

On 01/30, Gheorghe-Teod Bercea wrote:> Hi Johannes,
> 
> First of all thanks for looking into the matter of improving non-SPMD mode!
> 
> I have a question regarding the state machine that you said you'd like
to
> replace/improve. There are cases (such as target regions that span multiple
> compilation units) where the switch statement is required. Is this
something
> that your changes will touch in any way?
There will not be a difference. Let me explain in some details as there
seems to be a lot of confusion on this state machine topic:

Now:

Build a state machine in the user code (module) with all the parallel
regions as explicit targets of the switch statement and a fallback
default that does a indirect call to the requested parallel region.


Proposed, after Clang:

Use the runtime state machine implementation [0] which reduces the
switch to the default case, thus an indirect call to the requested
parallel region. This will always work, regardless of the translation
unit that contained the parallel region (pointer).

Proposed, after OpenMP-Opt pass in LLVM (assuming SPMD wasn't achieved):

All reachable parallel regions in a kernel are collected and used to
create the switch statement in the user code (module) [1, line 111] with
a fallback if there are potentially [1, line 212] hidden parallel
regions.


Does that make sense?


[0] https://reviews.llvm.org/D57460#change-e9Ljd9RgdWYz
[1] https://reviews.llvm.org/D57460#change-8gnnGNfJVR4B

> My next question is, for the workloads which are in the same compilation
unit
> there is a trick that code gen performs (or could perform I'm not sure
if this
> has been upstreamed) where it can check for the specific name of an
outlined
> function and then just call it directly thus making that function
inline-able
> (thus erasing most if not all the overhead of having the state machine in
the
> first place). In other words the "worst" part of the switch
statement will only
> apply to outlined functions from other compilation units. With this in mind
> what would the impact of your changes be in the end? If this part isn't
clear I
> can do some digging to find out how this actually works in more details
it's
> been too long since I've had to look at this part.
See the answer above.

> Can you share some performance numbers given an example you have been
looking
> at? I see you have one that uses "#pragma omp atomic". I would
avoid using
> something like that since it may have other overheads not related to your
> changes. I would put together an example with this directive structure:
> 
> #pragma omp target teams distribute
> for(...){
>   <code1>
>   #pragma omp parallel for
>   for(...) {
>     <code2>
>   }
>   <code3>
> }
> 
> which forces the use of the master-worker scheme (non-SPMD mode) without
any
> other distractions.
The atomic stuff I used to determine correctness. I haven't yet looked
at performance. I will do so now and inform you on my results.

> It would then be interesting to understand how you plan to change the LLVM
code
> generated for this,
The examples show how the LLVM-IR is supposed to look like, right?
> what the overheads that you're targeting are (register usage,
> synchronization cost etc), and then what the performance gain is
> compared to the current scheme.
I can also compare register usage in addition to performance but there
is no difference in synchronization. The number and (relative) order of
original runtime library calls stays the same. The number of user code
-> runtime library calls is even decreased.


Please let me know if this helps and what questions remain.

Thanks,
  Johannes


 > From:        "Doerfert, Johannes" <jdoerfert at anl.gov>
> To:        Alexey Bataev <a.bataev at outlook.com>
> Cc:        Guray Ozen <gozen at nvidia.com>, Gheorghe-Teod Bercea
> <gheorghe-teod.bercea at ibm.com>, "openmp-dev at
lists.llvm.org"
> <openmp-dev at lists.llvm.org>, "Finkel, Hal J."
<hfinkel at anl.gov>,
> "Gregory.Rodgers at amd.com" <Gregory.Rodgers at amd.com>,
"kli at ca.ibm.com"
> <kli at ca.ibm.com>
> Date:        01/30/2019 04:14 PM
> Subject:        Re: [RFC] Late (OpenMP) GPU code "SPMD-zation"
>
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
> 
> 
> 
> I don't really see "many ifs and maybes", actually none.
> 
> Anyway, I will now work on a patch set that adds the new functionality
under a
> cmd flag
> in order to showcase correctness and performance on real code.
> 
> If you, or somebody else, have interesting examples, please feel free to
point
> me at them.
> 
> Thanks,
>   Johannes
> 
> 
>
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
> 
> From: Alexey Bataev <a.bataev at outlook.com>
> Sent: Wednesday, January 30, 2019 2:18:19 PM
> To: Doerfert, Johannes
> Cc: Guray Ozen; Gheorghe-Teod Bercea; openmp-dev at lists.llvm.org; Finkel,
Hal
> J.; Gregory.Rodgers at amd.com; kli at ca.ibm.com
> Subject: Re: [RFC] Late (OpenMP) GPU code "SPMD-zation"
>  
> Currently, there are too many "if"s and "maybe"s. If
you can provide solution
> that does not break anything and does not affect the performance, does not
> require changes in the backend - then go ahead with the patches.
> 
> -------------
> Best regards,
> Alexey Bataev
> 30.01.2019 14:49, Doerfert, Johannes      :
> No, SPMD mode will not be affected at all.
> 
> The "worse" part is the following:
>   If we inline runtime library calls before the openmp-opt pass had a
chance to
> look at the code,
>   we will not have a customized state machine for the __non-SPMD__ case.
That
> is, the if-cascade
>   checking the work function pointer is not there.
> 
> Avoiding this potential performance decline is actually very easy. While we
do
> not have the "inline_late" capability,
> run the openmp-opt pass __before__ the inliner and we will not get
"worse"
> code. We might however miss out on
> _new_ non-SPMD -> SPMD transformations.
> 
> 
> Does that make sense?
> 
>
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
> 
> From: Alexey Bataev <a.bataev at outlook.com>
> Sent: Wednesday, January 30, 2019 1:44:10 PM
> To: Doerfert, Johannes
> Cc: Guray Ozen; Gheorghe-Teod Bercea; openmp-dev at lists.llvm.org; Finkel,
Hal
> J.; Gregory.Rodgers at amd.com; kli at ca.ibm.com
> Subject: Re: [RFC] Late (OpenMP) GPU code "SPMD-zation"
>  
> Any "worse" is not a good idea. We need to avoid it. It would be
good that the
> new code did not affect the performance, especially for SPMD mode (I think,
> this "worse" will affect exactly SPMD mode, no?)
> 
> -------------
> Best regards,
> Alexey Bataev
> 30.01.2019 14:38, Doerfert, Johannes      :
> The LLVM optimization (openmp-opt), which does non-SPMD -> SPMD and
custom
> state machine generation, will not fire if
> the __kernel_general_... calls are "missing". Thus if we inline
"to early", we
> are "stuck" with the non-SPMD choice (not worse than
> what we have now!) and the default library state machine ("worse"
than what we
> have right now). Does that make sense?
> 
> The second option described what I want to see us do "later" in
order to avoid
> the above scenario and always get both,
> openmp-opt and inlining of the runtime and work functions.
> 
> 
>
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
> 
> From: Alexey Bataev <a.bataev at outlook.com>
> Sent: Wednesday, January 30, 2019 1:25:42 PM
> To: Doerfert, Johannes
> Cc: Guray Ozen; Gheorghe-Teod Bercea; openmp-dev at lists.llvm.org; Finkel,
Hal
> J.; Gregory.Rodgers at amd.com; kli at ca.ibm.com
> Subject: Re: [RFC] Late (OpenMP) GPU code "SPMD-zation"
>  
> Sorry, did not understand your answer correctly. But you wrote:
> for now, not doing the optimization is just fine.
> What do you mean?
> 
> -------------
> Best regards,
> Alexey Bataev
> 30.01.2019 14:23, Doerfert, Johannes      :
> Alexey,
> 
> I'm not sure how to interpret "Bad idea!". but I think there
is again a
> misunderstanding.
> To help me understand, could you try to elaborate a bit?
> 
> To make my last email clear:
> I __do__ want inlining. Both answers to your earlier inlining questions do
> actually assume the runtime library calls __are eventually inlined__,
> that is why I mentioned LTO and the runtime as bitcode.
> .
> Cheers,
>   Johannes
> 
> 
> 
>
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
> 
> From: Alexey Bataev <a.bataev at outlook.com>
> Sent: Wednesday, January 30, 2019 1:14:56 PM
> To: Doerfert, Johannes
> Cc: Guray Ozen; Gheorghe-Teod Bercea; openmp-dev at lists.llvm.org; Finkel,
Hal
> J.; Gregory.Rodgers at amd.com; kli at ca.ibm.com
> Subject: Re: [RFC] Late (OpenMP) GPU code "SPMD-zation"
>  
> Bad idea!
> 
> -------------
> Best regards,
> Alexey Bataev
> 30.01.2019 14:11, Doerfert, Johannes      :
> Sure I do. Why do you think I don't?
> 
>
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
> 
> From: Alexey Bataev <a.bataev at outlook.com>
> Sent: Wednesday, January 30, 2019 1:00:59 PM
> To: Doerfert, Johannes
> Cc: Guray Ozen; Gheorghe-Teod Bercea; openmp-dev at lists.llvm.org; Finkel,
Hal
> J.; Gregory.Rodgers at amd.com; kli at ca.ibm.com
> Subject: Re: [RFC] Late (OpenMP) GPU code "SPMD-zation"
>  
> You don't want to do the inlining?
> 
> -------------
> Best regards,
> Alexey Bataev
> 30.01.2019 13:59, Doerfert, Johannes      :
> - for now, not doing the optimization is just fine. The whole idea is that
code
> is always valid.
> 
> 
-- 

Johannes Doerfert
Researcher

Argonne National Laboratory
Lemont, IL 60439, USA

jdoerfert at anl.gov
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 228 bytes
Desc: not available
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20190131/2f00ed57/attachment-0001.sig>

Gheorghe-Teod Bercea via llvm-dev

2019-Jan-31 15:05 UTC

head link

[llvm-dev] [RFC] Late (OpenMP) GPU code "SPMD-zation"

<font size=2 face="sans-serif">Hi
Johannes,</font><br><br><font size=2
face="sans-serif">Thank you for the
explanation.</font><br><br><font size=2
face="sans-serif">I think we need to clarify some details
about code generation in Clang today:</font><br><br><font
size=2 face="sans-serif">1. non-SPMD mode, or generic mode, uses
the master-worker code gen scheme where the master thread and the worker
threads are <b>disjoint </b>sets of threads (when one set runs the
other
set is blocked and doesn't participate in the
execution):</font><br><br><font size=2
face="Courier">workers  |  master</font><br><font
size=2
face="Courier">====================</font><br><font
size=2 face="Courier">BLOCKED  |
<b>RUNNING</b></font><br><font size=2
face="Courier">------- sync -------</font><br><font
size=2 face="Courier"><b>RUNNING  </b>|
BLOCKED</font><br><font size=2
face="Courier">------- sync -------</font><br><font
size=2 face="Courier">BLOCKED  |
<b>RUNNING</b></font><br><br><br><font
size=2 face="sans-serif">2. the worker threads, in their RUNNING
state above, contain a state machine which chooses the parallel region
to be executed. Today this choice happens in one of two ways: explicit
targets (where you know what outlined region you are calling and you just
call it) and indirect targets (via function pointer set by master thread
in one of its RUNNING regions):</font><br><br><font size=2
face="Courier">workers  |  master</font><br><font
size=2
face="Courier">====================</font><br><font
size=2 face="Courier">BLOCKED  |
<b>RUNNING</b></font><br><font size=2
face="Courier">------- sync -------</font><br><font
size=2 face="Courier"><b>RUNNING 
</b>|</font><br><font size=2
face="Courier"><b> state   </b>|
BLOCKED</font><br><font size=2
face="Courier"><b>machine 
</b>|</font><br><font size=2
face="Courier">------- sync -------</font><br><font
size=2 face="Courier">BLOCKED  |
<b>RUNNING</b></font><br><br><br><font
size=2 face="sans-serif">Your intended changes (only target the
RUNNING state machine of the WORKERS):</font><br><font size=2
face="sans-serif">- remove explicit targets from current
code gen. (by itself this is a major step back!!)</font><br><font
size=2 face="sans-serif">- introduce a pass in LLVM which will
add back the explicit targets.</font><br><br><font size=2
face="sans-serif">Can you point out any major improvements
this will bring compared to the current state?</font><br><font
size=2 face="sans-serif">From your answer below you mention a
lower number of function calls. Since today we inline everything anyway
how does that help?</font><br><font size=2
face="sans-serif">If you haven't considered performance
so far how come you're proposing all these changes? What led you to propose
all these changes?</font><br><br><br><font size=2
face="sans-serif">In SPMD mode all threads execute the
same code. Using the notation in the schemes above you can depict this
as:</font><br><br><font size=2 face="Courier">
all threads</font><br><font size=2
face="Courier">====================</font><br><font
size=2 face="Courier"><b>     
RUNNING</b></font><br><br><font size=2
face="sans-serif">No state machine being used, no disjoints
sets of threads. This is as if you're executing CUDA
code.</font><br><br><font size=2
face="sans-serif">Could you explain what your proposed
changes are in this context?</font><br><font size=2
face="sans-serif">Could you also explain what you mean
by "</font><tt><font size=2>assuming SPMD wasn't
achieved</font></tt><font size=2
face="sans-serif">"?</font><br><font size=2
face="sans-serif">Do you expect to write another LLVM
pass which will transform the master-worker scheme + state machine into
an SPMD scheme?</font><br><br><font size=2
face="sans-serif">Thanks,</font><br><br><font
size=2
face="sans-serif">--Doru</font><br><br><br><font
size=2
face="sans-serif"><br></font><br><br><br><br><font
size=1 color=#5f5f5f face="sans-serif">From:
 </font><font size=1 face="sans-serif">"Doerfert,
Johannes"
<jdoerfert@anl.gov></font><br><font size=1 color=#5f5f5f
face="sans-serif">To:
 </font><font size=1 face="sans-serif">Gheorghe-Teod
Bercea
<Gheorghe-Teod.Bercea@ibm.com></font><br><font size=1
color=#5f5f5f face="sans-serif">Cc:
 </font><font size=1 face="sans-serif">Alexey Bataev
<a.bataev@outlook.com>,
Guray Ozen <gozen@nvidia.com>, "Gregory.Rodgers@amd.com"
<Gregory.Rodgers@amd.com>, "Finkel, Hal J."
<hfinkel@anl.gov>,
"kli@ca.ibm.com" <kli@ca.ibm.com>,
"openmp-dev@lists.llvm.org"
<openmp-dev@lists.llvm.org>, LLVM-Dev <llvm-dev@lists.llvm.org>,
"cfe-dev@lists.llvm.org"
<cfe-dev@lists.llvm.org></font><br><font size=1
color=#5f5f5f face="sans-serif">Date:
 </font><font size=1 face="sans-serif">01/30/2019 07:56
PM</font><br><font size=1 color=#5f5f5f
face="sans-serif">Subject:
   </font><font size=1 face="sans-serif">Re: [RFC] Late
(OpenMP) GPU code "SPMD-zation"</font><br><hr
noshade><br><br><br><tt><font size=2>Hi
Doru,<br><br>[+ llvm-dev and cfe-dev]<br><br>On 01/30,
Gheorghe-Teod Bercea wrote:<br>> Hi Johannes,<br>>
<br>> First of all thanks for looking into the matter of improving
non-SPMD
mode!<br>> <br>> I have a question regarding the state machine
that you said you'd
like to<br>> replace/improve. There are cases (such as target regions
that span
multiple<br>> compilation units) where the switch statement is
required. Is this
something<br>> that your changes will touch in any
way?<br><br>There will not be a difference. Let me explain in some
details as there<br>seems to be a lot of confusion on this state machine
topic:<br><br>Now:<br><br>Build a state machine in the
user code (module) with all the parallel<br>regions as explicit targets of
the switch statement and a fallback<br>default that does a indirect call
to the requested parallel region.<br><br><br>Proposed, after
Clang:<br><br>Use the runtime state machine implementation [0] which
reduces the<br>switch to the default case, thus an indirect call to the
requested<br>parallel region. This will always work, regardless of the
translation<br>unit that contained the parallel region
(pointer).<br><br>Proposed, after OpenMP-Opt pass in LLVM (assuming
SPMD wasn't achieved):<br><br>All reachable parallel regions in
a kernel are collected and used to<br>create the switch statement in the
user code (module) [1, line 111] with<br>a fallback if there are
potentially [1, line 212] hidden
parallel<br>regions.<br><br><br>Does that make
sense?<br><br><br>[0] </font></tt><a
href="https://reviews.llvm.org/D57460#change-e9Ljd9RgdWYz"><tt><font
size=2>https://reviews.llvm.org/D57460#change-e9Ljd9RgdWYz</font></tt></a><tt><font
size=2><br>[1] </font></tt><a
href="https://reviews.llvm.org/D57460#change-8gnnGNfJVR4B"><tt><font
size=2>https://reviews.llvm.org/D57460#change-8gnnGNfJVR4B</font></tt></a><tt><font
size=2><br><br><br>> My next question is, for the
workloads which are in the same compilation
unit<br>> there is a trick that code gen performs (or could perform
I'm not
sure if this<br>> has been upstreamed) where it can check for the
specific name of an
outlined<br>> function and then just call it directly thus making that
function
inline-able<br>> (thus erasing most if not all the overhead of having
the state machine
in the<br>> first place). In other words the "worst" part of
the switch
statement will only<br>> apply to outlined functions from other
compilation units. With this
in mind<br>> what would the impact of your changes be in the end? If
this part
isn't clear I<br>> can do some digging to find out how this
actually works in more details
it's<br>> been too long since I've had to look at this
part.<br><br>See the answer above.<br><br><br>>
Can you share some performance numbers given an example you have been
looking<br>> at? I see you have one that uses "#pragma omp
atomic". I
would avoid using<br>> something like that since it may have other
overheads not related
to your<br>> changes. I would put together an example with this
directive structure:<br>> <br>> #pragma omp target teams
distribute<br>> for(...){<br>>   <code1><br>>  
#pragma omp parallel for<br>>   for(...) {<br>>    
<code2><br>>   }<br>>   <code3><br>>
}<br>> <br>> which forces the use of the master-worker scheme
(non-SPMD mode) without
any<br>> other distractions.<br><br>The atomic stuff I used
to determine correctness. I haven't yet looked<br>at performance. I
will do so now and inform you on my results.<br><br><br>>
It would then be interesting to understand how you plan to change
the LLVM code<br>> generated for this,<br><br>The examples
show how the LLVM-IR is supposed to look like, right?<br><br>>
what the overheads that you're targeting are (register usage,<br>>
synchronization cost etc), and then what the performance gain is<br>>
compared to the current scheme.<br><br>I can also compare register
usage in addition to performance but there<br>is no difference in
synchronization. The number and (relative) order of<br>original runtime
library calls stays the same. The number of user code<br>-> runtime
library calls is even decreased.<br><br><br>Please let me know
if this helps and what questions remain.<br><br>Thanks,<br> 
Johannes<br><br><br> <br>> From:       
"Doerfert, Johannes" <jdoerfert@anl.gov><br>> To:      
Alexey Bataev <a.bataev@outlook.com><br>> Cc:        Guray Ozen
<gozen@nvidia.com>,
Gheorghe-Teod Bercea<br>> <gheorghe-teod.bercea@ibm.com>,
"openmp-dev@lists.llvm.org"<br>>
<openmp-dev@lists.llvm.org>, "Finkel, Hal J."
<hfinkel@anl.gov>,<br>> "Gregory.Rodgers@amd.com"
<Gregory.Rodgers@amd.com>,
"kli@ca.ibm.com"<br>> <kli@ca.ibm.com><br>>
Date:        01/30/2019 04:14 PM<br>> Subject:        Re: [RFC] Late
(OpenMP) GPU code
"SPMD-zation"<br>>
$B(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(B<br>>
<br>> <br>> <br>> I don't really see "many
ifs and maybes", actually none.<br>> <br>> Anyway, I will
now work on a patch set that adds the new functionality
under a<br>> cmd flag<br>> in order to showcase correctness
and performance on real code.<br>> <br>> If you, or somebody
else, have interesting examples, please feel free
to point<br>> me at them.<br>> <br>>
Thanks,<br>>   Johannes<br>> <br>> <br>>
$B(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(B<br>>
<br>> From: Alexey Bataev <a.bataev@outlook.com><br>>
Sent: Wednesday, January 30, 2019 2:18:19 PM<br>> To: Doerfert,
Johannes<br>> Cc: Guray Ozen; Gheorghe-Teod Bercea;
openmp-dev@lists.llvm.org; Finkel,
Hal<br>> J.; Gregory.Rodgers@amd.com; kli@ca.ibm.com<br>>
Subject: Re: [RFC] Late (OpenMP) GPU code "SPMD-zation"<br>> 
<br>> Currently, there are too many "if"s and
"maybe"s.
If you can provide solution<br>> that does not break anything and does
not affect the performance,
does not<br>> require changes in the backend - then go ahead with the
patches.<br>> <br>> -------------<br>> Best
regards,<br>> Alexey Bataev<br>> 30.01.2019 14:49, Doerfert,
Johannes      :<br>> No, SPMD mode will not be affected at
all.<br>> <br>> The "worse" part is the
following:<br>>   If we inline runtime library calls before the
openmp-opt pass
had a chance to<br>> look at the code,<br>>   we will not have
a customized state machine for the __non-SPMD__
case. That<br>> is, the if-cascade<br>>   checking the work
function pointer is not there.<br>> <br>> Avoiding this
potential performance decline is actually very easy.
While we do<br>> not have the "inline_late"
capability,<br>> run the openmp-opt pass __before__ the inliner and we
will not get
"worse"<br>> code. We might however miss out on<br>>
_new_ non-SPMD -> SPMD transformations.<br>> <br>>
<br>> Does that make sense?<br>> <br>>
$B(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(B<br>>
<br>> From: Alexey Bataev <a.bataev@outlook.com><br>>
Sent: Wednesday, January 30, 2019 1:44:10 PM<br>> To: Doerfert,
Johannes<br>> Cc: Guray Ozen; Gheorghe-Teod Bercea;
openmp-dev@lists.llvm.org; Finkel,
Hal<br>> J.; Gregory.Rodgers@amd.com; kli@ca.ibm.com<br>>
Subject: Re: [RFC] Late (OpenMP) GPU code "SPMD-zation"<br>> 
<br>> Any "worse" is not a good idea. We need to avoid it. It
would be good that the<br>> new code did not affect the performance,
especially for SPMD mode
(I think,<br>> this "worse" will affect exactly SPMD mode,
no?)<br>> <br>> -------------<br>> Best
regards,<br>> Alexey Bataev<br>> 30.01.2019 14:38, Doerfert,
Johannes      :<br>> The LLVM optimization (openmp-opt), which does
non-SPMD -> SPMD
and custom<br>> state machine generation, will not fire
if<br>> the __kernel_general_... calls are "missing". Thus if
we
inline "to early", we<br>> are "stuck" with the
non-SPMD choice (not worse than<br>> what we have now!) and the default
library state machine ("worse"
than what we<br>> have right now). Does that make sense?<br>>
<br>> The second option described what I want to see us do
"later"
in order to avoid<br>> the above scenario and always get
both,<br>> openmp-opt and inlining of the runtime and work
functions.<br>> <br>> <br>>
$B(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(B<br>>
<br>> From: Alexey Bataev <a.bataev@outlook.com><br>>
Sent: Wednesday, January 30, 2019 1:25:42 PM<br>> To: Doerfert,
Johannes<br>> Cc: Guray Ozen; Gheorghe-Teod Bercea;
openmp-dev@lists.llvm.org; Finkel,
Hal<br>> J.; Gregory.Rodgers@amd.com; kli@ca.ibm.com<br>>
Subject: Re: [RFC] Late (OpenMP) GPU code "SPMD-zation"<br>> 
<br>> Sorry, did not understand your answer correctly. But you
wrote:<br>> for now, not doing the optimization is just
fine.<br>> What do you mean?<br>> <br>>
-------------<br>> Best regards,<br>> Alexey
Bataev<br>> 30.01.2019 14:23, Doerfert, Johannes      :<br>>
Alexey,<br>> <br>> I'm not sure how to interpret "Bad
idea!". but I think there
is again a<br>> misunderstanding.<br>> To help me understand,
could you try to elaborate a bit?<br>> <br>> To make my last
email clear:<br>> I __do__ want inlining. Both answers to your earlier
inlining questions
do<br>> actually assume the runtime library calls __are eventually
inlined__,<br>> that is why I mentioned LTO and the runtime as
bitcode.<br>> .<br>> Cheers,<br>>  
Johannes<br>> <br>> <br>> <br>>
$B(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(B<br>>
<br>> From: Alexey Bataev <a.bataev@outlook.com><br>>
Sent: Wednesday, January 30, 2019 1:14:56 PM<br>> To: Doerfert,
Johannes<br>> Cc: Guray Ozen; Gheorghe-Teod Bercea;
openmp-dev@lists.llvm.org; Finkel,
Hal<br>> J.; Gregory.Rodgers@amd.com; kli@ca.ibm.com<br>>
Subject: Re: [RFC] Late (OpenMP) GPU code "SPMD-zation"<br>> 
<br>> Bad idea!<br>> <br>>
-------------<br>> Best regards,<br>> Alexey
Bataev<br>> 30.01.2019 14:11, Doerfert, Johannes      :<br>>
Sure I do. Why do you think I don't?<br>> <br>>
$B(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(B<br>>
<br>> From: Alexey Bataev <a.bataev@outlook.com><br>>
Sent: Wednesday, January 30, 2019 1:00:59 PM<br>> To: Doerfert,
Johannes<br>> Cc: Guray Ozen; Gheorghe-Teod Bercea;
openmp-dev@lists.llvm.org; Finkel,
Hal<br>> J.; Gregory.Rodgers@amd.com; kli@ca.ibm.com<br>>
Subject: Re: [RFC] Late (OpenMP) GPU code "SPMD-zation"<br>> 
<br>> You don't want to do the inlining?<br>>
<br>> -------------<br>> Best regards,<br>> Alexey
Bataev<br>> 30.01.2019 13:59, Doerfert, Johannes      :<br>> -
for now, not doing the optimization is just fine. The whole idea
is that code<br>> is always valid.<br>> <br>>
<br><br>-- <br><br>Johannes
Doerfert<br>Researcher<br><br>Argonne National
Laboratory<br>Lemont, IL 60439,
USA<br><br>jdoerfert@anl.gov<br>[attachment
"signature.asc" deleted by Gheorghe-Teod Bercea/US/IBM]
</font></tt><br><br><BR>

Doerfert, Johannes via llvm-dev

2019-Mar-13 19:08 UTC

head link

[llvm-dev] [RFC] Late (OpenMP) GPU code "SPMD-zation"

Please consider reviewing the code for the proposed approach here:
  https://reviews.llvm.org/D57460

Initial tests, e.g., on the nw (needleman-wunsch) benchmark in the
rodinia 3.1 benchmark suite, showed 30% improvement after SPMD mode was
enabled automatically. The code in nw is conceptually equivalent to the
first example in the "to_SPMD_mode.ll" test case that can be found
here:
  https://reviews.llvm.org/D57460#change-sBfg7kuN4Bid

The implementation is missing key features but one should be able to see
the overall design by now. Once accepted, the missing features and more
optimizations will be added.


On 01/22, Johannes Doerfert wrote:> Where we are
> ------------
> 
> Currently, when we generate OpenMP target offloading code for GPUs, we
> use sufficient syntactic criteria to decide between two execution modes:
>   1)      SPMD -- All target threads (in an OpenMP team) run all the code.
>   2) "Guarded" -- The master thread (of an OpenMP team) runs the
user
>                   code. If an OpenMP distribute region is encountered, thus
>                   if all threads (in the OpenMP team) are supposed to
>                   execute the region, the master wakes up the idling
>                   worker threads and points them to the correct piece of
>                   code for distributed execution.
> 
> For a variety of reasons we (generally) prefer the first execution mode.
> However, depending on the code, that might not be valid, or we might
> just not know if it is in the Clang code generation phase.
> 
> The implementation of the "guarded" execution mode follows
roughly the
> state machine description in [1], though the implementation is different
> (more general) nowadays.
> 
> 
> What we want
> ------------
> 
> Increase the amount of code executed in SPMD mode and the use of
> lightweight "guarding" schemes where appropriate.
> 
> 
> How we get (could) there
> ------------------------
> 
> We propose the following two modifications in order:
> 
>   1) Move the state machine logic into the OpenMP runtime library. That
>      means in SPMD mode all device threads will start the execution of
>      the user code, thus emerge from the runtime, while in guarded mode
>      only the master will escape the runtime and the other threads will
>      idle in their state machine code that is now just "hidden".
> 
>      Why:
>      - The state machine code cannot be (reasonably) optimized anyway,
>        moving it into the library shouldn't hurt runtime but might even
>        improve compile time a little bit.
>      - The change should also simplify the Clang code generation as we
>        would generate structurally the same code for both execution modes
>        but only the runtime library calls, or their arguments, would
>        differ between them.
>      - The reason we should not "just start in SPMD mode" and
"repair"
>        it later is simple, this way we always have semantically correct
>        and executable code.
>      - Finally, and most importantly, there is now only little
>        difference (see above) between the two modes in the code
>        generated by clang. If we later analyze the code trying to decide
>        if we can use SPMD mode instead of guarded mode the analysis and
>        transformation becomes much simpler.
> 
>  2) Implement a middle-end LLVM-IR pass that detects the guarded mode,
>     e.g., through the runtime library calls used, and that tries to
>     convert it into the SPMD mode potentially by introducing lightweight
>     guards in the process.
> 
>     Why:
>     - After the inliner, and the canonicalizations, we have a clearer
>       picture of the code that is actually executed in the target
>       region and all the side effects it contains. Thus, we can make an
>       educated decision on the required amount of guards that prevent
>       unwanted side effects from happening after a move to SPMD mode.
>     - At this point we can more easily introduce different schemes to
>       avoid side effects by threads that were not supposed to run. We
>       can decide if a state machine is needed, conditionals should be
>       employed, masked instructions are appropriate, or "dummy"
local
>       storage can be used to hide the side effect from the outside
>       world.
> 
> 
> None of this was implemented yet but we plan to start in the immediate
> future. Any comments, ideas, criticism is welcome!
> 
> 
> Cheers,
>   Johannes
> 
> 
> P.S. [2-4] Provide further information on implementation and features.
> 
> [1] https://ieeexplore.ieee.org/document/7069297
> [2] https://dl.acm.org/citation.cfm?id=2833161
> [3] https://dl.acm.org/citation.cfm?id=3018870
> [4] https://dl.acm.org/citation.cfm?id=3148189
> 
> 
> -- 
> 
> Johannes Doerfert
> Researcher
> 
> Argonne National Laboratory
> Lemont, IL 60439, USA
> 
> jdoerfert at anl.gov


-- 

Johannes Doerfert
Researcher

Argonne National Laboratory
Lemont, IL 60439, USA

jdoerfert at anl.gov
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 228 bytes
Desc: not available
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20190313/a9109ea4/attachment.sig>

Alexey Bataev via llvm-dev

2019-Mar-13 19:15 UTC

head link

[llvm-dev] [RFC] Late (OpenMP) GPU code "SPMD-zation"

There are tooooooo(!) many changes, I don't who's going to review sooooo
big patch. You definitely need to split it into several smaller patches.
Also, I don't like the idea adding of one more class for NVPTX codegen.
All your changes should be on top of the eixisting solution.

-------------
Best regards,
Alexey Bataev

13.03.2019 15:08, Doerfert, Johannes пишет:> Please consider reviewing the code for the proposed approach here:
>   https://reviews.llvm.org/D57460
>
> Initial tests, e.g., on the nw (needleman-wunsch) benchmark in the
> rodinia 3.1 benchmark suite, showed 30% improvement after SPMD mode was
> enabled automatically. The code in nw is conceptually equivalent to the
> first example in the "to_SPMD_mode.ll" test case that can be
found here:
>   https://reviews.llvm.org/D57460#change-sBfg7kuN4Bid
>
> The implementation is missing key features but one should be able to see
> the overall design by now. Once accepted, the missing features and more
> optimizations will be added.
>
>
> On 01/22, Johannes Doerfert wrote:
>> Where we are
>> ------------
>>
>> Currently, when we generate OpenMP target offloading code for GPUs, we
>> use sufficient syntactic criteria to decide between two execution
modes:
>>   1)      SPMD -- All target threads (in an OpenMP team) run all the
code.
>>   2) "Guarded" -- The master thread (of an OpenMP team) runs
the user
>>                   code. If an OpenMP distribute region is encountered,
thus
>>                   if all threads (in the OpenMP team) are supposed to
>>                   execute the region, the master wakes up the idling
>>                   worker threads and points them to the correct piece
of
>>                   code for distributed execution.
>>
>> For a variety of reasons we (generally) prefer the first execution
mode.
>> However, depending on the code, that might not be valid, or we might
>> just not know if it is in the Clang code generation phase.
>>
>> The implementation of the "guarded" execution mode follows
roughly the
>> state machine description in [1], though the implementation is
different
>> (more general) nowadays.
>>
>>
>> What we want
>> ------------
>>
>> Increase the amount of code executed in SPMD mode and the use of
>> lightweight "guarding" schemes where appropriate.
>>
>>
>> How we get (could) there
>> ------------------------
>>
>> We propose the following two modifications in order:
>>
>>   1) Move the state machine logic into the OpenMP runtime library. That
>>      means in SPMD mode all device threads will start the execution of
>>      the user code, thus emerge from the runtime, while in guarded mode
>>      only the master will escape the runtime and the other threads will
>>      idle in their state machine code that is now just
"hidden".
>>
>>      Why:
>>      - The state machine code cannot be (reasonably) optimized anyway,
>>        moving it into the library shouldn't hurt runtime but might
even
>>        improve compile time a little bit.
>>      - The change should also simplify the Clang code generation as we
>>        would generate structurally the same code for both execution
modes
>>        but only the runtime library calls, or their arguments, would
>>        differ between them.
>>      - The reason we should not "just start in SPMD mode" and
"repair"
>>        it later is simple, this way we always have semantically correct
>>        and executable code.
>>      - Finally, and most importantly, there is now only little
>>        difference (see above) between the two modes in the code
>>        generated by clang. If we later analyze the code trying to
decide
>>        if we can use SPMD mode instead of guarded mode the analysis and
>>        transformation becomes much simpler.
>>
>>  2) Implement a middle-end LLVM-IR pass that detects the guarded mode,
>>     e.g., through the runtime library calls used, and that tries to
>>     convert it into the SPMD mode potentially by introducing
lightweight
>>     guards in the process.
>>
>>     Why:
>>     - After the inliner, and the canonicalizations, we have a clearer
>>       picture of the code that is actually executed in the target
>>       region and all the side effects it contains. Thus, we can make an
>>       educated decision on the required amount of guards that prevent
>>       unwanted side effects from happening after a move to SPMD mode.
>>     - At this point we can more easily introduce different schemes to
>>       avoid side effects by threads that were not supposed to run. We
>>       can decide if a state machine is needed, conditionals should be
>>       employed, masked instructions are appropriate, or
"dummy" local
>>       storage can be used to hide the side effect from the outside
>>       world.
>>
>>
>> None of this was implemented yet but we plan to start in the immediate
>> future. Any comments, ideas, criticism is welcome!
>>
>>
>> Cheers,
>>   Johannes
>>
>>
>> P.S. [2-4] Provide further information on implementation and features.
>>
>> [1] https://ieeexplore.ieee.org/document/7069297
>> [2] https://dl.acm.org/citation.cfm?id=2833161
>> [3] https://dl.acm.org/citation.cfm?id=3018870
>> [4] https://dl.acm.org/citation.cfm?id=3148189
>>
>>
>> -- 
>>
>> Johannes Doerfert
>> Researcher
>>
>> Argonne National Laboratory
>> Lemont, IL 60439, USA
>>
>> jdoerfert at anl.gov
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20190313/c7b21e92/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: OpenPGP digital signature
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20190313/c7b21e92/attachment.sig>

Doerfert, Johannes via llvm-dev

2019-Mar-25 17:20 UTC

head link

[llvm-dev] [RFC] Late (OpenMP) GPU code "SPMD-zation"

As the email thread is by now long and hard to follow, I wanted to start
a new branch explaining the patches for reviews, how I think we
should proceed from here, and some feature/problem discussion.


The patches:
-----------

Currently, the initial implementation* is split across the following
seven patches shown in "dependence order":
  [OpenMP][Helper]     https://reviews.llvm.org/D59424
  [OpenMP][Runtime]    https://reviews.llvm.org/D59319
  [Clang][Helper]      https://reviews.llvm.org/D59418
  [Clang][Helper]      https://reviews.llvm.org/D59420
  [Clang][Helper]      https://reviews.llvm.org/D59421
  [Clang][Codegen]     https://reviews.llvm.org/D59328
  [LLVM][Optimization] https://reviews.llvm.org/D59331

* The original, now abandoned, aggregate patch can be found here:
  https://reviews.llvm.org/D57460


Next steps:
----------

I kindly ask interested parties to post questions, comments, and
reviews. It would also be good if people could look into alternative
target device library implementations, e.g., for AMD GPUs or other
non-NVIDIA targets. This would help to test the "hardware agnostic"
hypothesis.


(Missing) features & known problems:
-----------------------------------

The initial implementation discussed above includes the following
functionalities:
  - Generate valid LLVM-IR for "omp target" with enclosed, potentially
    nested, "omp parallel" pragmas.
  - Translate non-SPMD mode regions to SPMD mode regions if that is
    valid without code changes.
  - Create a customized state machine for non-SPMD mode regions.
    Customization for now means all visible enclosed parallel regions
    are checked as part of an if-cascade and called directly before a
    potential fallback indirect call is reached.

Missing features and known problems:
  - Reductions are not supported yet. My plan is to use the ideas
    presented in by Garcia De Gonzalo et al. [1] at CGO'19 in the
    runtime and let clang emit some kind of
      "__kmpc_XXXX_reduction_begin(kind, loc)"
      "__kmpc_XXXX_reduction_end(kind, loc)"
    calls at the beginning and end of the kernel. The runtime or LLVM
    optimization should then decide on the reduction strategy.
  - Critical regions are not supported yet. The NVPTX codegen approach
    is probably fine, we just need to port it.
  - When changing non-SPMD mode kernels to SPMD mode kernels we might need
    to change the schedule decisions for loops. As a consequence, we
    might want to add a level of abstraction for these as well to make
    that simple.


Thanks,
  Johannes


[1] Automatic Generation of Warp-Level Primitives and Atomic
Instructions for Fast and Portable Parallel Reduction on GPUs Simon
Garcia De Gonzalo and Sitao Huang (University of Illinois at
Urbana–Champaign); Juan Gomez-Luna (Swiss Federal Institute of
Technology(ETH) Zurich); Simon Hammond (Sandia National Laboratories);
Onur Mutlu (Swiss Federal Institute of Technology (ETH) Zurich); Wen-mei
Hwu (University of Illinois at Urbana–Champaign) 




On 01/22, Doerfert, Johannes Rudolf via llvm-dev wrote:> Where we are
> ------------
> 
> Currently, when we generate OpenMP target offloading code for GPUs, we
> use sufficient syntactic criteria to decide between two execution modes:
>   1)      SPMD -- All target threads (in an OpenMP team) run all the code.
>   2) "Guarded" -- The master thread (of an OpenMP team) runs the
user
>                   code. If an OpenMP distribute region is encountered, thus
>                   if all threads (in the OpenMP team) are supposed to
>                   execute the region, the master wakes up the idling
>                   worker threads and points them to the correct piece of
>                   code for distributed execution.
> 
> For a variety of reasons we (generally) prefer the first execution mode.
> However, depending on the code, that might not be valid, or we might
> just not know if it is in the Clang code generation phase.
> 
> The implementation of the "guarded" execution mode follows
roughly the
> state machine description in [1], though the implementation is different
> (more general) nowadays.
> 
> 
> What we want
> ------------
> 
> Increase the amount of code executed in SPMD mode and the use of
> lightweight "guarding" schemes where appropriate.
> 
> 
> How we get (could) there
> ------------------------
> 
> We propose the following two modifications in order:
> 
>   1) Move the state machine logic into the OpenMP runtime library. That
>      means in SPMD mode all device threads will start the execution of
>      the user code, thus emerge from the runtime, while in guarded mode
>      only the master will escape the runtime and the other threads will
>      idle in their state machine code that is now just "hidden".
> 
>      Why:
>      - The state machine code cannot be (reasonably) optimized anyway,
>        moving it into the library shouldn't hurt runtime but might even
>        improve compile time a little bit.
>      - The change should also simplify the Clang code generation as we
>        would generate structurally the same code for both execution modes
>        but only the runtime library calls, or their arguments, would
>        differ between them.
>      - The reason we should not "just start in SPMD mode" and
"repair"
>        it later is simple, this way we always have semantically correct
>        and executable code.
>      - Finally, and most importantly, there is now only little
>        difference (see above) between the two modes in the code
>        generated by clang. If we later analyze the code trying to decide
>        if we can use SPMD mode instead of guarded mode the analysis and
>        transformation becomes much simpler.
> 
>  2) Implement a middle-end LLVM-IR pass that detects the guarded mode,
>     e.g., through the runtime library calls used, and that tries to
>     convert it into the SPMD mode potentially by introducing lightweight
>     guards in the process.
> 
>     Why:
>     - After the inliner, and the canonicalizations, we have a clearer
>       picture of the code that is actually executed in the target
>       region and all the side effects it contains. Thus, we can make an
>       educated decision on the required amount of guards that prevent
>       unwanted side effects from happening after a move to SPMD mode.
>     - At this point we can more easily introduce different schemes to
>       avoid side effects by threads that were not supposed to run. We
>       can decide if a state machine is needed, conditionals should be
>       employed, masked instructions are appropriate, or "dummy"
local
>       storage can be used to hide the side effect from the outside
>       world.
> 
> 
> None of this was implemented yet but we plan to start in the immediate
> future. Any comments, ideas, criticism is welcome!
> 
> 
> Cheers,
>   Johannes
> 
> 
> P.S. [2-4] Provide further information on implementation and features.
> 
> [1] https://ieeexplore.ieee.org/document/7069297
> [2] https://dl.acm.org/citation.cfm?id=2833161
> [3] https://dl.acm.org/citation.cfm?id=3018870
> [4] https://dl.acm.org/citation.cfm?id=3148189
> 
> 
> -- 
> 
> Johannes Doerfert
> Researcher
> 
> Argonne National Laboratory
> Lemont, IL 60439, USA
> 
> jdoerfert at anl.gov

> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

-- 

Johannes Doerfert
Researcher

Argonne National Laboratory
Lemont, IL 60439, USA

jdoerfert at anl.gov
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 228 bytes
Desc: not available
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20190325/45ff424c/attachment.sig>

Maybe Matching Threads

Search for more maybe matching threads

llvm dev - Mar 2019 - [RFC] Late (OpenMP) GPU code "SPMD-zation"

[llvm-dev] [RFC] Late (OpenMP) GPU code "SPMD-zation"

[llvm-dev] [RFC] Late (OpenMP) GPU code "SPMD-zation"

[llvm-dev] [RFC] Late (OpenMP) GPU code "SPMD-zation"

[llvm-dev] [RFC] Late (OpenMP) GPU code "SPMD-zation"

[llvm-dev] [RFC] Late (OpenMP) GPU code "SPMD-zation"

[llvm-dev] [RFC] Late (OpenMP) GPU code "SPMD-zation"

[llvm-dev] [RFC] Late (OpenMP) GPU code "SPMD-zation"

[llvm-dev] [RFC] Late (OpenMP) GPU code "SPMD-zation"

Maybe Matching Threads