thr3ads.net - llvm dev - [llvm-dev] [RFC] Late (OpenMP) GPU code "SPMD-zation" [Jan 2019]

If this information is useful, please help other people find it:
Share via:

Doerfert, Johannes Rudolf via llvm-dev

2019-Jan-22 18:54 UTC

[llvm-dev] [RFC] Late (OpenMP) GPU code "SPMD-zation"

We could still do that in clang, couldn't we?

Get Outlook for Android<https://aka.ms/ghei36>

________________________________
From: Alexey Bataev <a.bataev at outlook.com>
Sent: Tuesday, January 22, 2019 12:52:42 PM
To: Doerfert, Johannes Rudolf; cfe-dev at lists.llvm.org
Cc: openmp-dev at lists.llvm.org; LLVM-Dev; Finkel, Hal J.; Alexey Bataev;
Arpith Chacko Jacob
Subject: Re: [RFC] Late (OpenMP) GPU code "SPMD-zation"


The globalization for the local variables, for example. It must be implemented
in the compiler to get the good performance, not in the runtime.


-------------
Best regards,
Alexey Bataev

22.01.2019 13:43, Doerfert, Johannes Rudolf пишет:
Could you elaborate on what you refer to wrt data sharing. What do we currently
do in the clang code generation that we could not effectively implement in the
runtime, potentially with support of an llvm pass.

Thanks,
  James

Get Outlook for Android<https://aka.ms/ghei36>

________________________________
From: Alexey Bataev <a.bataev at outlook.com><mailto:a.bataev at
outlook.com>
Sent: Tuesday, January 22, 2019 12:34:01 PM
To: Doerfert, Johannes Rudolf; cfe-dev at lists.llvm.org<mailto:cfe-dev at
lists.llvm.org>
Cc: openmp-dev at lists.llvm.org<mailto:openmp-dev at lists.llvm.org>;
LLVM-Dev; Finkel, Hal J.; Alexey Bataev; Arpith Chacko Jacob
Subject: Re: [RFC] Late (OpenMP) GPU code "SPMD-zation"



-------------
Best regards,
Alexey Bataev

22.01.2019 13:17, Doerfert, Johannes Rudolf пишет:

Where we are
------------

Currently, when we generate OpenMP target offloading code for GPUs, we
use sufficient syntactic criteria to decide between two execution modes:
  1)      SPMD -- All target threads (in an OpenMP team) run all the code.
  2) "Guarded" -- The master thread (of an OpenMP team) runs the user
                  code. If an OpenMP distribute region is encountered, thus
                  if all threads (in the OpenMP team) are supposed to
                  execute the region, the master wakes up the idling
                  worker threads and points them to the correct piece of
                  code for distributed execution.

For a variety of reasons we (generally) prefer the first execution mode.
However, depending on the code, that might not be valid, or we might
just not know if it is in the Clang code generation phase.

The implementation of the "guarded" execution mode follows roughly the
state machine description in [1], though the implementation is different
(more general) nowadays.


What we want
------------

Increase the amount of code executed in SPMD mode and the use of
lightweight "guarding" schemes where appropriate.


How we get (could) there
------------------------

We propose the following two modifications in order:

  1) Move the state machine logic into the OpenMP runtime library. That
     means in SPMD mode all device threads will start the execution of
     the user code, thus emerge from the runtime, while in guarded mode
     only the master will escape the runtime and the other threads will
     idle in their state machine code that is now just "hidden".

     Why:
     - The state machine code cannot be (reasonably) optimized anyway,
       moving it into the library shouldn't hurt runtime but might even
       improve compile time a little bit.
     - The change should also simplify the Clang code generation as we
       would generate structurally the same code for both execution modes
       but only the runtime library calls, or their arguments, would
       differ between them.
     - The reason we should not "just start in SPMD mode" and
"repair"
       it later is simple, this way we always have semantically correct
       and executable code.
     - Finally, and most importantly, there is now only little
       difference (see above) between the two modes in the code
       generated by clang. If we later analyze the code trying to decide
       if we can use SPMD mode instead of guarded mode the analysis and
       transformation becomes much simpler.

The last item is wrong, unfortunately. A lot of things in the codegen depend on
the execution mode, e.g. correct support of the data-sharing. Of course, we can
try to generalize the codegen and rely completely on the runtime, but the
performance is going to be very poor.

We still need static analysis in the compiler. I agree, that it is better to
move this analysis to the backend, at least after the inlining, but at the
moment it is not possible. We need the support for the late outlining, which
will allow to implement better detection of the SPMD constructs + improve
performance.

 2) Implement a middle-end LLVM-IR pass that detects the guarded mode,
    e.g., through the runtime library calls used, and that tries to
    convert it into the SPMD mode potentially by introducing lightweight
    guards in the process.

    Why:
    - After the inliner, and the canonicalizations, we have a clearer
      picture of the code that is actually executed in the target
      region and all the side effects it contains. Thus, we can make an
      educated decision on the required amount of guards that prevent
      unwanted side effects from happening after a move to SPMD mode.
    - At this point we can more easily introduce different schemes to
      avoid side effects by threads that were not supposed to run. We
      can decide if a state machine is needed, conditionals should be
      employed, masked instructions are appropriate, or "dummy" local
      storage can be used to hide the side effect from the outside
      world.


None of this was implemented yet but we plan to start in the immediate
future. Any comments, ideas, criticism is welcome!


Cheers,
  Johannes


P.S. [2-4] Provide further information on implementation and features.

[1] https://ieeexplore.ieee.org/document/7069297
[2] https://dl.acm.org/citation.cfm?id=2833161
[3] https://dl.acm.org/citation.cfm?id=3018870
[4] https://dl.acm.org/citation.cfm?id=3148189



-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20190122/3d32b515/attachment.html>

Alexey Bataev via llvm-dev

2019-Jan-22 19:10 UTC

head link

[llvm-dev] [RFC] Late (OpenMP) GPU code "SPMD-zation"

But we need to know the execution mode, SPMD or "guarded"

-------------
Best regards,
Alexey Bataev

22.01.2019 13:54, Doerfert, Johannes Rudolf пишет:> We could still do that in clang, couldn't we?
>
> Get Outlook for Android <https://aka.ms/ghei36>
>
> ------------------------------------------------------------------------
> *From:* Alexey Bataev <a.bataev at outlook.com>
> *Sent:* Tuesday, January 22, 2019 12:52:42 PM
> *To:* Doerfert, Johannes Rudolf; cfe-dev at lists.llvm.org
> *Cc:* openmp-dev at lists.llvm.org; LLVM-Dev; Finkel, Hal J.; Alexey
> Bataev; Arpith Chacko Jacob
> *Subject:* Re: [RFC] Late (OpenMP) GPU code "SPMD-zation"
>  
>
> The globalization for the local variables, for example. It must be
> implemented in the compiler to get the good performance, not in the
> runtime.
>
>
> -------------
> Best regards,
> Alexey Bataev
> 22.01.2019 13:43, Doerfert, Johannes Rudolf пишет:
>> Could you elaborate on what you refer to wrt data sharing. What do we
>> currently do in the clang code generation that we could not
>> effectively implement in the runtime, potentially with support of an
>> llvm pass.
>>
>> Thanks,
>>   James
>>
>> Get Outlook for Android <https://aka.ms/ghei36>
>>
>>
------------------------------------------------------------------------
>> *From:* Alexey Bataev <a.bataev at outlook.com>
>> *Sent:* Tuesday, January 22, 2019 12:34:01 PM
>> *To:* Doerfert, Johannes Rudolf; cfe-dev at lists.llvm.org
>> *Cc:* openmp-dev at lists.llvm.org; LLVM-Dev; Finkel, Hal J.; Alexey
>> Bataev; Arpith Chacko Jacob
>> *Subject:* Re: [RFC] Late (OpenMP) GPU code "SPMD-zation"
>>  
>>
>>
>> -------------
>> Best regards,
>> Alexey Bataev
>> 22.01.2019 13:17, Doerfert, Johannes Rudolf пишет:
>>> Where we are
>>> ------------
>>>
>>> Currently, when we generate OpenMP target offloading code for GPUs,
we
>>> use sufficient syntactic criteria to decide between two execution
modes:
>>>   1)      SPMD -- All target threads (in an OpenMP team) run all
the code.
>>>   2) "Guarded" -- The master thread (of an OpenMP team)
runs the user
>>>                   code. If an OpenMP distribute region is
encountered, thus
>>>                   if all threads (in the OpenMP team) are supposed
to
>>>                   execute the region, the master wakes up the
idling
>>>                   worker threads and points them to the correct
piece of
>>>                   code for distributed execution.
>>>
>>> For a variety of reasons we (generally) prefer the first execution
mode.
>>> However, depending on the code, that might not be valid, or we
might
>>> just not know if it is in the Clang code generation phase.
>>>
>>> The implementation of the "guarded" execution mode
follows roughly the
>>> state machine description in [1], though the implementation is
different
>>> (more general) nowadays.
>>>
>>>
>>> What we want
>>> ------------
>>>
>>> Increase the amount of code executed in SPMD mode and the use of
>>> lightweight "guarding" schemes where appropriate.
>>>
>>>
>>> How we get (could) there
>>> ------------------------
>>>
>>> We propose the following two modifications in order:
>>>
>>>   1) Move the state machine logic into the OpenMP runtime library.
That
>>>      means in SPMD mode all device threads will start the execution
of
>>>      the user code, thus emerge from the runtime, while in guarded
mode
>>>      only the master will escape the runtime and the other threads
will
>>>      idle in their state machine code that is now just
"hidden".
>>>
>>>      Why:
>>>      - The state machine code cannot be (reasonably) optimized
anyway,
>>>        moving it into the library shouldn't hurt runtime but
might even
>>>        improve compile time a little bit.
>>>      - The change should also simplify the Clang code generation as
we
>>>        would generate structurally the same code for both execution
modes
>>>        but only the runtime library calls, or their arguments,
would
>>>        differ between them.
>>>      - The reason we should not "just start in SPMD mode"
and "repair"
>>>        it later is simple, this way we always have semantically
correct
>>>        and executable code.
>>>      - Finally, and most importantly, there is now only little
>>>        difference (see above) between the two modes in the code
>>>        generated by clang. If we later analyze the code trying to
decide
>>>        if we can use SPMD mode instead of guarded mode the analysis
and
>>>        transformation becomes much simpler.
>>
>> The last item is wrong, unfortunately. A lot of things in the codegen
>> depend on the execution mode, e.g. correct support of the
>> data-sharing. Of course, we can try to generalize the codegen and
>> rely completely on the runtime, but the performance is going to be
>> very poor.
>>
>> We still need static analysis in the compiler. I agree, that it is
>> better to move this analysis to the backend, at least after the
>> inlining, but at the moment it is not possible. We need the support
>> for the late outlining, which will allow to implement better
>> detection of the SPMD constructs + improve performance.
>>
>>>  2) Implement a middle-end LLVM-IR pass that detects the guarded
mode,
>>>     e.g., through the runtime library calls used, and that tries to
>>>     convert it into the SPMD mode potentially by introducing
lightweight
>>>     guards in the process.
>>>
>>>     Why:
>>>     - After the inliner, and the canonicalizations, we have a
clearer
>>>       picture of the code that is actually executed in the target
>>>       region and all the side effects it contains. Thus, we can
make an
>>>       educated decision on the required amount of guards that
prevent
>>>       unwanted side effects from happening after a move to SPMD
mode.
>>>     - At this point we can more easily introduce different schemes
to
>>>       avoid side effects by threads that were not supposed to run.
We
>>>       can decide if a state machine is needed, conditionals should
be
>>>       employed, masked instructions are appropriate, or
"dummy" local
>>>       storage can be used to hide the side effect from the outside
>>>       world.
>>>
>>>
>>> None of this was implemented yet but we plan to start in the
immediate
>>> future. Any comments, ideas, criticism is welcome!
>>>
>>>
>>> Cheers,
>>>   Johannes
>>>
>>>
>>> P.S. [2-4] Provide further information on implementation and
features.
>>>
>>> [1] https://ieeexplore.ieee.org/document/7069297
>>> [2] https://dl.acm.org/citation.cfm?id=2833161
>>> [3] https://dl.acm.org/citation.cfm?id=3018870
>>> [4] https://dl.acm.org/citation.cfm?id=3148189
>>>
>>>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20190122/b727ad6a/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: OpenPGP digital signature
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20190122/b727ad6a/attachment.sig>

Doerfert, Johannes Rudolf via llvm-dev

2019-Jan-22 19:29 UTC

head link

[llvm-dev] [RFC] Late (OpenMP) GPU code "SPMD-zation"

We would still know that. We can do exactly the same reasoning as we do now.

I think the important question is, how different is the code generated for
either mode and can we hide (most of) the differences in the runtime.


If I understand you correctly, you say the data sharing code looks very
different and the differences cannot be hidden, correct?

It would be helpful for me to understand your point if you could give me a piece
of OpenMP for which the data sharing in SPMD mode and "guarded"

mode are as different as possible. I can compile it in both modes myself so
high-level OpenMP is fine (I will disable SPMD mode manually in the source if
necessary).


Thanks,

  Johannes



________________________________
From: llvm-dev <llvm-dev-bounces at lists.llvm.org> on behalf of Alexey
Bataev via llvm-dev <llvm-dev at lists.llvm.org>
Sent: Tuesday, January 22, 2019 13:10
To: Doerfert, Johannes Rudolf
Cc: Alexey Bataev; LLVM-Dev; Arpith Chacko Jacob; openmp-dev at lists.llvm.org;
cfe-dev at lists.llvm.org
Subject: Re: [llvm-dev] [RFC] Late (OpenMP) GPU code "SPMD-zation"

But we need to know the execution mode, SPMD or "guarded"


-------------
Best regards,
Alexey Bataev

22.01.2019 13:54, Doerfert, Johannes Rudolf пишет:
We could still do that in clang, couldn't we?

Get Outlook for Android<https://aka.ms/ghei36>

________________________________
From: Alexey Bataev <a.bataev at outlook.com><mailto:a.bataev at
outlook.com>
Sent: Tuesday, January 22, 2019 12:52:42 PM
To: Doerfert, Johannes Rudolf; cfe-dev at lists.llvm.org<mailto:cfe-dev at
lists.llvm.org>
Cc: openmp-dev at lists.llvm.org<mailto:openmp-dev at lists.llvm.org>;
LLVM-Dev; Finkel, Hal J.; Alexey Bataev; Arpith Chacko Jacob
Subject: Re: [RFC] Late (OpenMP) GPU code "SPMD-zation"


The globalization for the local variables, for example. It must be implemented
in the compiler to get the good performance, not in the runtime.


-------------
Best regards,
Alexey Bataev

22.01.2019 13:43, Doerfert, Johannes Rudolf пишет:
Could you elaborate on what you refer to wrt data sharing. What do we currently
do in the clang code generation that we could not effectively implement in the
runtime, potentially with support of an llvm pass.

Thanks,
  James

Get Outlook for Android<https://aka.ms/ghei36>

________________________________
From: Alexey Bataev <a.bataev at outlook.com><mailto:a.bataev at
outlook.com>
Sent: Tuesday, January 22, 2019 12:34:01 PM
To: Doerfert, Johannes Rudolf; cfe-dev at lists.llvm.org<mailto:cfe-dev at
lists.llvm.org>
Cc: openmp-dev at lists.llvm.org<mailto:openmp-dev at lists.llvm.org>;
LLVM-Dev; Finkel, Hal J.; Alexey Bataev; Arpith Chacko Jacob
Subject: Re: [RFC] Late (OpenMP) GPU code "SPMD-zation"



-------------
Best regards,
Alexey Bataev

22.01.2019 13:17, Doerfert, Johannes Rudolf пишет:

Where we are
------------

Currently, when we generate OpenMP target offloading code for GPUs, we
use sufficient syntactic criteria to decide between two execution modes:
  1)      SPMD -- All target threads (in an OpenMP team) run all the code.
  2) "Guarded" -- The master thread (of an OpenMP team) runs the user
                  code. If an OpenMP distribute region is encountered, thus
                  if all threads (in the OpenMP team) are supposed to
                  execute the region, the master wakes up the idling
                  worker threads and points them to the correct piece of
                  code for distributed execution.

For a variety of reasons we (generally) prefer the first execution mode.
However, depending on the code, that might not be valid, or we might
just not know if it is in the Clang code generation phase.

The implementation of the "guarded" execution mode follows roughly the
state machine description in [1], though the implementation is different
(more general) nowadays.


What we want
------------

Increase the amount of code executed in SPMD mode and the use of
lightweight "guarding" schemes where appropriate.


How we get (could) there
------------------------

We propose the following two modifications in order:

  1) Move the state machine logic into the OpenMP runtime library. That
     means in SPMD mode all device threads will start the execution of
     the user code, thus emerge from the runtime, while in guarded mode
     only the master will escape the runtime and the other threads will
     idle in their state machine code that is now just "hidden".

     Why:
     - The state machine code cannot be (reasonably) optimized anyway,
       moving it into the library shouldn't hurt runtime but might even
       improve compile time a little bit.
     - The change should also simplify the Clang code generation as we
       would generate structurally the same code for both execution modes
       but only the runtime library calls, or their arguments, would
       differ between them.
     - The reason we should not "just start in SPMD mode" and
"repair"
       it later is simple, this way we always have semantically correct
       and executable code.
     - Finally, and most importantly, there is now only little
       difference (see above) between the two modes in the code
       generated by clang. If we later analyze the code trying to decide
       if we can use SPMD mode instead of guarded mode the analysis and
       transformation becomes much simpler.

The last item is wrong, unfortunately. A lot of things in the codegen depend on
the execution mode, e.g. correct support of the data-sharing. Of course, we can
try to generalize the codegen and rely completely on the runtime, but the
performance is going to be very poor.

We still need static analysis in the compiler. I agree, that it is better to
move this analysis to the backend, at least after the inlining, but at the
moment it is not possible. We need the support for the late outlining, which
will allow to implement better detection of the SPMD constructs + improve
performance.

 2) Implement a middle-end LLVM-IR pass that detects the guarded mode,
    e.g., through the runtime library calls used, and that tries to
    convert it into the SPMD mode potentially by introducing lightweight
    guards in the process.

    Why:
    - After the inliner, and the canonicalizations, we have a clearer
      picture of the code that is actually executed in the target
      region and all the side effects it contains. Thus, we can make an
      educated decision on the required amount of guards that prevent
      unwanted side effects from happening after a move to SPMD mode.
    - At this point we can more easily introduce different schemes to
      avoid side effects by threads that were not supposed to run. We
      can decide if a state machine is needed, conditionals should be
      employed, masked instructions are appropriate, or "dummy" local
      storage can be used to hide the side effect from the outside
      world.


None of this was implemented yet but we plan to start in the immediate
future. Any comments, ideas, criticism is welcome!


Cheers,
  Johannes


P.S. [2-4] Provide further information on implementation and features.

[1] https://ieeexplore.ieee.org/document/7069297
[2] https://dl.acm.org/citation.cfm?id=2833161
[3] https://dl.acm.org/citation.cfm?id=3018870
[4] https://dl.acm.org/citation.cfm?id=3148189



-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20190122/11fa9b0d/attachment-0001.html>

Possibly Parallel Threads

Search for more seemingly similar threads

llvm dev - Jan 2019 - [RFC] Late (OpenMP) GPU code "SPMD-zation"

[llvm-dev] [RFC] Late (OpenMP) GPU code "SPMD-zation"

[llvm-dev] [RFC] Late (OpenMP) GPU code "SPMD-zation"

[llvm-dev] [RFC] Late (OpenMP) GPU code "SPMD-zation"

Possibly Parallel Threads