thr3ads.net - llvm dev - [llvm-dev] [RFC] Late (OpenMP) GPU code "SPMD-zation" [Mar 2019]

If this information is useful, please help other people find it:
Share via:

Alexey Bataev via llvm-dev

2019-Mar-13 22:54 UTC

[llvm-dev] [RFC] Late (OpenMP) GPU code "SPMD-zation"

Johannes, did you try it on AMD GPUs? If not, I think it might be early to claim
it as a general interface for NVidia/AMD GPUs. I'm ok, if you want
tointroduce a basic class for the GPU-specific codegen, but it must be done
step-by-step and thoroughly tested and reviewed. Theremightbe some parts, common
with NVPTX codegen. You can put the commonfunctions into a base class and remove
them from NVPTX implementation. But all this must be done in small parts.

Best regards,
Alexey Bataev

13 марта 2019 г., в 17:49, Doerfert, Johannes <jdoerfert at
anl.gov<mailto:jdoerfert at anl.gov>> написал(а):


Alexey,


since we seem to disagree on the need for changes as I propose them in general,
as well as how they should be done,

I would like to avoid spending more and more time on this before it is clear
where we are heading. Once that is

clear, I'm more than happy to split the patches even further and improve the
test coverage. I will then also welcome feedback

from you and other reviewers on how to split them.


With regards to the NVPTX support, I don't see your problem. We have NVPTX
support through the current code generation,

adding this alternative and developing it on the side will not change that.


Obviously, this stuff is not as well tested right now but without putting it out
there it will never be. I will not be able to do the

testing alone. Worst case, there is no continues involvement in this
codegen/interface/optimization scheme and we remove

it again. Since it does need modifications of any existing code that will not be
a problem.


Best regards,

  Johannes




________________________________
From: Alexey Bataev <a.bataev at hotmail.com<mailto:a.bataev at
hotmail.com>>
Sent: Wednesday, March 13, 2019 4:33:03 PM
To: Doerfert, Johannes
Cc: Alexey Bataev; cfe-dev at lists.llvm.org<mailto:cfe-dev at
lists.llvm.org>; openmp-dev at lists.llvm.org<mailto:openmp-dev at
lists.llvm.org>; llvm-dev; Finkel, Hal J.
Subject: Re: [RFC] Late (OpenMP) GPU code "SPMD-zation"

1. You don't need to implement everything in a single patch. The development
process is a step-by-step process, when you commit something in small pieces.
The code must nit be fully functional, you may start from some basic features.
Currently it is very hard to review.
2. I rather doubt that it can be reused without changes for AMD etc., especially
without being fully tested. The only tested target is NVPTX and at first we need
to support it. Later, we could extend it to AMD and some other targets.
3. No, it is not incidental. It is thoroughly tested, at least.
4. Hm, if it would be so, I would just ignored it. Yes, I'm a bit sceptical,
but this is normal. It is the fact that these patches break Coding standard,
which suggests to split patches into small pieces and commit them one by one.

Best regards,
Alexey Bataev
> 13 марта 2019 г., в 17:18, Doerfert, Johannes <jdoerfert at
anl.gov<mailto:jdoerfert at anl.gov>> написал(а):
>
>> On 03/13, Alexey Bataev wrote:
>> 13.03.2019 15:35, Doerfert, Johannes пишет:
>>>
>>> Hi Alexey,
>>>
>>>
>>> thank you for your quick feedback.
>>>
>>>
>>>> There are tooooooo(!) many changes, I don't who's going
to review sooooo big
>>> patch.
>>>
>>>
>>> I can for sure split it in the three components/repositories that
are
>>> touched, clang, llvm, and openmp.
>>>
>>> I feared it will then be harder to navigate the code in order to
see
>>> the connection points.
>>>
>>> I am a bit amazed by your hyperbolism though given the complexity
is
>>> not that height
>>>
>>> due to the absence of modified or removed lines. Anyway, you seem
to
>>> have very strong feelings about
>>>
>>> this so I am open to suggestion on how to split it up.
>>>
>>>
>>
>> 1. You definitely need to split it into separate patches for different
>> components.
>
> Done:
>  OpenMP: https://reviews.llvm.org/D59319
>   Clang: https://reviews.llvm.org/D59328
>    LLVM: https://reviews.llvm.org/D59331
>
>> 2. Even inside of those components this patch must be split into
several
>> small patches, it is very hard to review so big patches.
>
> Please take a look at the three patches above.
>
> The first contains the interface definition and implementation for NVPTX
> (in cuda). I don't know how to further split that except to separate it
> into the definition and the implementation, though that does not make
> sense to me.
>
> The second contains the code generation. It is very much like the NVPTX
> code generation except that it does not contain logic.
>
> The third is the LLVM pass which could be split into two, SPMD-mode and
> state machine creation. I'll wait for feedback on the other patches
> until I go ahead.
>
>
>>>> Also, I don't like the idea adding of one more class for
NVPTX
>>> codegen. All your changes should be on top of the eixisting
solution.
>>>
>>>
>>> Could you please explain to me why? This will only make everything
>>> more complicated and entangled.
>>> Also, the new class is supposed to be "target agnostic"
so a new
>>> offloading target, e.g., AMD GPUs, could easily reuse
>>> the new code while the old code is sprinkled with NVPTX specific
>>> details, e.g., function calls, constants, etc.
>>>
>> 1. As far as I know, even now the NVPTX codegen can be reused for AMD
>> GPUs with some small changes.
>
> The target region code generation is supposed to be reusable for
> AMD/XYZ/... without changes.
>
>
>> 2. Your patch is about codegen for NVPTX, so you must change the
>> existing codegen, but not to introduce the new one for the same target.
>
> I strongly disagree. The patch is not "for NVPTX" but for
"OpenMP target
> offloading", maybe with a focus on "GPU kernels". The fact
that the only
> target offloading device we currently support is based on Cuda and NVPTX
> is incidental.
>
>
>> There is no point to maintain two different codegens for one target.
>
> Given your comments on my initial RFC and prototype I strongly suspected
> you do not want this approach to replace the current NVPTX code
> generation. Once that changes we can get rid of one of them.
>
>
>>>
>>> Thanks again,
>>>   Johannes
>>>
------------------------------------------------------------------------
>>> *From:* Alexey Bataev <a.bataev at
outlook.com<mailto:a.bataev at outlook.com>>
>>> *Sent:* Wednesday, March 13, 2019 2:15:39 PM
>>> *To:* Doerfert, Johannes; cfe-dev at
lists.llvm.org<mailto:cfe-dev at lists.llvm.org>
>>> *Cc:* openmp-dev at lists.llvm.org<mailto:openmp-dev at
lists.llvm.org>; LLVM-Dev; Finkel, Hal J.; Alexey
>>> Bataev; Arpith Chacko Jacob
>>> *Subject:* Re: [RFC] Late (OpenMP) GPU code "SPMD-zation"
>>>
>>>
>>> There are tooooooo(!) many changes, I don't who's going to
review
>>> sooooo big patch. You definitely need to split it into several
smaller
>>> patches. Also, I don't like the idea adding of one more class
for
>>> NVPTX codegen. All your changes should be on top of the eixisting
>>> solution.
>>>
>>> -------------
>>> Best regards,
>>> Alexey Bataev
>>> 13.03.2019 15:08, Doerfert, Johannes пишет:
>>>> Please consider reviewing the code for the proposed approach
here:
>>>>  https://reviews.llvm.org/D57460
>>>>
>>>> Initial tests, e.g., on the nw (needleman-wunsch) benchmark in
the
>>>> rodinia 3.1 benchmark suite, showed 30% improvement after SPMD
mode was
>>>> enabled automatically. The code in nw is conceptually
equivalent to the
>>>> first example in the "to_SPMD_mode.ll" test case that
can be found here:
>>>>  https://reviews.llvm.org/D57460#change-sBfg7kuN4Bid
>>>>
>>>> The implementation is missing key features but one should be
able to see
>>>> the overall design by now. Once accepted, the missing features
and more
>>>> optimizations will be added.
>>>>
>>>>
>>>>> On 01/22, Johannes Doerfert wrote:
>>>>> Where we are
>>>>> ------------
>>>>>
>>>>> Currently, when we generate OpenMP target offloading code
for GPUs, we
>>>>> use sufficient syntactic criteria to decide between two
execution modes:
>>>>>  1)      SPMD -- All target threads (in an OpenMP team) run
all the code.
>>>>>  2) "Guarded" -- The master thread (of an OpenMP
team) runs the user
>>>>>                  code. If an OpenMP distribute region is
encountered, thus
>>>>>                  if all threads (in the OpenMP team) are
supposed to
>>>>>                  execute the region, the master wakes up
the idling
>>>>>                  worker threads and points them to the
correct piece of
>>>>>                  code for distributed execution.
>>>>>
>>>>> For a variety of reasons we (generally) prefer the first
execution mode.
>>>>> However, depending on the code, that might not be valid, or
we might
>>>>> just not know if it is in the Clang code generation phase.
>>>>>
>>>>> The implementation of the "guarded" execution
mode follows roughly the
>>>>> state machine description in [1], though the implementation
is different
>>>>> (more general) nowadays.
>>>>>
>>>>>
>>>>> What we want
>>>>> ------------
>>>>>
>>>>> Increase the amount of code executed in SPMD mode and the
use of
>>>>> lightweight "guarding" schemes where appropriate.
>>>>>
>>>>>
>>>>> How we get (could) there
>>>>> ------------------------
>>>>>
>>>>> We propose the following two modifications in order:
>>>>>
>>>>>  1) Move the state machine logic into the OpenMP runtime
library. That
>>>>>     means in SPMD mode all device threads will start the
execution of
>>>>>     the user code, thus emerge from the runtime, while in
guarded mode
>>>>>     only the master will escape the runtime and the other
threads will
>>>>>     idle in their state machine code that is now just
"hidden".
>>>>>
>>>>>     Why:
>>>>>     - The state machine code cannot be (reasonably)
optimized anyway,
>>>>>       moving it into the library shouldn't hurt runtime
but might even
>>>>>       improve compile time a little bit.
>>>>>     - The change should also simplify the Clang code
generation as we
>>>>>       would generate structurally the same code for both
execution modes
>>>>>       but only the runtime library calls, or their
arguments, would
>>>>>       differ between them.
>>>>>     - The reason we should not "just start in SPMD
mode" and "repair"
>>>>>       it later is simple, this way we always have
semantically correct
>>>>>       and executable code.
>>>>>     - Finally, and most importantly, there is now only
little
>>>>>       difference (see above) between the two modes in the
code
>>>>>       generated by clang. If we later analyze the code
trying to decide
>>>>>       if we can use SPMD mode instead of guarded mode the
analysis and
>>>>>       transformation becomes much simpler.
>>>>>
>>>>> 2) Implement a middle-end LLVM-IR pass that detects the
guarded mode,
>>>>>    e.g., through the runtime library calls used, and that
tries to
>>>>>    convert it into the SPMD mode potentially by introducing
lightweight
>>>>>    guards in the process.
>>>>>
>>>>>    Why:
>>>>>    - After the inliner, and the canonicalizations, we have
a clearer
>>>>>      picture of the code that is actually executed in the
target
>>>>>      region and all the side effects it contains. Thus, we
can make an
>>>>>      educated decision on the required amount of guards
that prevent
>>>>>      unwanted side effects from happening after a move to
SPMD mode.
>>>>>    - At this point we can more easily introduce different
schemes to
>>>>>      avoid side effects by threads that were not supposed
to run. We
>>>>>      can decide if a state machine is needed, conditionals
should be
>>>>>      employed, masked instructions are appropriate, or
"dummy" local
>>>>>      storage can be used to hide the side effect from the
outside
>>>>>      world.
>>>>>
>>>>>
>>>>> None of this was implemented yet but we plan to start in
the immediate
>>>>> future. Any comments, ideas, criticism is welcome!
>>>>>
>>>>>
>>>>> Cheers,
>>>>>  Johannes
>>>>>
>>>>>
>>>>> P.S. [2-4] Provide further information on implementation
and features.
>>>>>
>>>>> [1] https://ieeexplore.ieee.org/document/7069297
>>>>> [2] https://dl.acm.org/citation.cfm?id=2833161
>>>>> [3] https://dl.acm.org/citation.cfm?id=3018870
>>>>> [4] https://dl.acm.org/citation.cfm?id=3148189
>>>>>
>>>>>
>>>>> --
>>>>>
>>>>> Johannes Doerfert
>>>>> Researcher
>>>>>
>>>>> Argonne National Laboratory
>>>>> Lemont, IL 60439, USA
>>>>>
>>>>> jdoerfert at anl.gov<mailto:jdoerfert at anl.gov>
<mailto:jdoerfert at anl.gov>
>
>
>
>
> --
>
> Johannes Doerfert
> Researcher
>
> Argonne National Laboratory
> Lemont, IL 60439, USA
>
> jdoerfert at anl.gov<mailto:jdoerfert at anl.gov>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20190313/ba5848a7/attachment-0001.html>

Doerfert, Johannes via llvm-dev

2019-Mar-13 23:29 UTC

head link

[llvm-dev] [RFC] Late (OpenMP) GPU code "SPMD-zation"

I get the part about "splitting and testing". I really do.

What I would like you _and others_ to tell me is if this general
approach, thus not the exact patches currently on phabricator, is what
we want to pursue. This is important because you seem to disagree on
this very basic question. You also seem to oppose every design decision,
especially those that make this approach free-standing.


On 03/13, Alexey Bataev wrote:> Johannes, did you try it on AMD GPUs? If not, I think it might be
> early to claim it as a general interface for NVidia/AMD GPUs. I'm ok,
> if you want tointroduce a basic class for the GPU-specific codegen,
> but it must be done step-by-step and thoroughly tested and reviewed.
> Theremightbe some parts, common with NVPTX codegen. You can put the
> commonfunctions into a base class and remove them from NVPTX
> implementation. But all this must be done in small parts.
> 
> Best regards,
> Alexey Bataev
> 
> 13 марта 2019 г., в 17:49, Doerfert, Johannes <jdoerfert at
anl.gov<mailto:jdoerfert at anl.gov>> написал(а):
> 
> 
> Alexey,
> 
> 
> since we seem to disagree on the need for changes as I propose them in
general, as well as how they should be done,
> 
> I would like to avoid spending more and more time on this before it is
clear where we are heading. Once that is
> 
> clear, I'm more than happy to split the patches even further and
improve the test coverage. I will then also welcome feedback
> 
> from you and other reviewers on how to split them.
> 
> 
> With regards to the NVPTX support, I don't see your problem. We have
NVPTX support through the current code generation,
> 
> adding this alternative and developing it on the side will not change that.
> 
> 
> Obviously, this stuff is not as well tested right now but without putting
it out there it will never be. I will not be able to do the
> 
> testing alone. Worst case, there is no continues involvement in this
codegen/interface/optimization scheme and we remove
> 
> it again. Since it does need modifications of any existing code that will
not be a problem.
> 
> 
> Best regards,
> 
>   Johannes
> 
> 
> 
> 
> ________________________________
> From: Alexey Bataev <a.bataev at hotmail.com<mailto:a.bataev at
hotmail.com>>
> Sent: Wednesday, March 13, 2019 4:33:03 PM
> To: Doerfert, Johannes
> Cc: Alexey Bataev; cfe-dev at lists.llvm.org<mailto:cfe-dev at
lists.llvm.org>; openmp-dev at lists.llvm.org<mailto:openmp-dev at
lists.llvm.org>; llvm-dev; Finkel, Hal J.
> Subject: Re: [RFC] Late (OpenMP) GPU code "SPMD-zation"
> 
> 1. You don't need to implement everything in a single patch. The
development process is a step-by-step process, when you commit something in
small pieces. The code must nit be fully functional, you may start from some
basic features. Currently it is very hard to review.
> 2. I rather doubt that it can be reused without changes for AMD etc.,
especially without being fully tested. The only tested target is NVPTX and at
first we need to support it. Later, we could extend it to AMD and some other
targets.
> 3. No, it is not incidental. It is thoroughly tested, at least.
> 4. Hm, if it would be so, I would just ignored it. Yes, I'm a bit
sceptical, but this is normal. It is the fact that these patches break Coding
standard, which suggests to split patches into small pieces and commit them one
by one.
> 
> Best regards,
> Alexey Bataev
> 
> > 13 марта 2019 г., в 17:18, Doerfert, Johannes <jdoerfert at
anl.gov<mailto:jdoerfert at anl.gov>> написал(а):
> >
> >> On 03/13, Alexey Bataev wrote:
> >> 13.03.2019 15:35, Doerfert, Johannes пишет:
> >>>
> >>> Hi Alexey,
> >>>
> >>>
> >>> thank you for your quick feedback.
> >>>
> >>>
> >>>> There are tooooooo(!) many changes, I don't who's
going to review sooooo big
> >>> patch.
> >>>
> >>>
> >>> I can for sure split it in the three components/repositories
that are
> >>> touched, clang, llvm, and openmp.
> >>>
> >>> I feared it will then be harder to navigate the code in order
to see
> >>> the connection points.
> >>>
> >>> I am a bit amazed by your hyperbolism though given the
complexity is
> >>> not that height
> >>>
> >>> due to the absence of modified or removed lines. Anyway, you
seem to
> >>> have very strong feelings about
> >>>
> >>> this so I am open to suggestion on how to split it up.
> >>>
> >>>
> >>
> >> 1. You definitely need to split it into separate patches for
different
> >> components.
> >
> > Done:
> >  OpenMP: https://reviews.llvm.org/D59319
> >   Clang: https://reviews.llvm.org/D59328
> >    LLVM: https://reviews.llvm.org/D59331
> >
> >> 2. Even inside of those components this patch must be split into
several
> >> small patches, it is very hard to review so big patches.
> >
> > Please take a look at the three patches above.
> >
> > The first contains the interface definition and implementation for
NVPTX
> > (in cuda). I don't know how to further split that except to
separate it
> > into the definition and the implementation, though that does not make
> > sense to me.
> >
> > The second contains the code generation. It is very much like the
NVPTX
> > code generation except that it does not contain logic.
> >
> > The third is the LLVM pass which could be split into two, SPMD-mode
and
> > state machine creation. I'll wait for feedback on the other
patches
> > until I go ahead.
> >
> >
> >>>> Also, I don't like the idea adding of one more class
for NVPTX
> >>> codegen. All your changes should be on top of the eixisting
solution.
> >>>
> >>>
> >>> Could you please explain to me why? This will only make
everything
> >>> more complicated and entangled.
> >>> Also, the new class is supposed to be "target
agnostic" so a new
> >>> offloading target, e.g., AMD GPUs, could easily reuse
> >>> the new code while the old code is sprinkled with NVPTX
specific
> >>> details, e.g., function calls, constants, etc.
> >>>
> >> 1. As far as I know, even now the NVPTX codegen can be reused for
AMD
> >> GPUs with some small changes.
> >
> > The target region code generation is supposed to be reusable for
> > AMD/XYZ/... without changes.
> >
> >
> >> 2. Your patch is about codegen for NVPTX, so you must change the
> >> existing codegen, but not to introduce the new one for the same
target.
> >
> > I strongly disagree. The patch is not "for NVPTX" but for
"OpenMP target
> > offloading", maybe with a focus on "GPU kernels". The
fact that the only
> > target offloading device we currently support is based on Cuda and
NVPTX
> > is incidental.
> >
> >
> >> There is no point to maintain two different codegens for one
target.
> >
> > Given your comments on my initial RFC and prototype I strongly
suspected
> > you do not want this approach to replace the current NVPTX code
> > generation. Once that changes we can get rid of one of them.
> >
> >
> >>>
> >>> Thanks again,
> >>>   Johannes
> >>>
------------------------------------------------------------------------
> >>> *From:* Alexey Bataev <a.bataev at
outlook.com<mailto:a.bataev at outlook.com>>
> >>> *Sent:* Wednesday, March 13, 2019 2:15:39 PM
> >>> *To:* Doerfert, Johannes; cfe-dev at
lists.llvm.org<mailto:cfe-dev at lists.llvm.org>
> >>> *Cc:* openmp-dev at lists.llvm.org<mailto:openmp-dev at
lists.llvm.org>; LLVM-Dev; Finkel, Hal J.; Alexey
> >>> Bataev; Arpith Chacko Jacob
> >>> *Subject:* Re: [RFC] Late (OpenMP) GPU code
"SPMD-zation"
> >>>
> >>>
> >>> There are tooooooo(!) many changes, I don't who's
going to review
> >>> sooooo big patch. You definitely need to split it into several
smaller
> >>> patches. Also, I don't like the idea adding of one more
class for
> >>> NVPTX codegen. All your changes should be on top of the
eixisting
> >>> solution.
> >>>
> >>> -------------
> >>> Best regards,
> >>> Alexey Bataev
> >>> 13.03.2019 15:08, Doerfert, Johannes пишет:
> >>>> Please consider reviewing the code for the proposed
approach here:
> >>>>  https://reviews.llvm.org/D57460
> >>>>
> >>>> Initial tests, e.g., on the nw (needleman-wunsch)
benchmark in the
> >>>> rodinia 3.1 benchmark suite, showed 30% improvement after
SPMD mode was
> >>>> enabled automatically. The code in nw is conceptually
equivalent to the
> >>>> first example in the "to_SPMD_mode.ll" test case
that can be found here:
> >>>>  https://reviews.llvm.org/D57460#change-sBfg7kuN4Bid
> >>>>
> >>>> The implementation is missing key features but one should
be able to see
> >>>> the overall design by now. Once accepted, the missing
features and more
> >>>> optimizations will be added.
> >>>>
> >>>>
> >>>>> On 01/22, Johannes Doerfert wrote:
> >>>>> Where we are
> >>>>> ------------
> >>>>>
> >>>>> Currently, when we generate OpenMP target offloading
code for GPUs, we
> >>>>> use sufficient syntactic criteria to decide between
two execution modes:
> >>>>>  1)      SPMD -- All target threads (in an OpenMP
team) run all the code.
> >>>>>  2) "Guarded" -- The master thread (of an
OpenMP team) runs the user
> >>>>>                  code. If an OpenMP distribute region
is encountered, thus
> >>>>>                  if all threads (in the OpenMP team)
are supposed to
> >>>>>                  execute the region, the master wakes
up the idling
> >>>>>                  worker threads and points them to the
correct piece of
> >>>>>                  code for distributed execution.
> >>>>>
> >>>>> For a variety of reasons we (generally) prefer the
first execution mode.
> >>>>> However, depending on the code, that might not be
valid, or we might
> >>>>> just not know if it is in the Clang code generation
phase.
> >>>>>
> >>>>> The implementation of the "guarded"
execution mode follows roughly the
> >>>>> state machine description in [1], though the
implementation is different
> >>>>> (more general) nowadays.
> >>>>>
> >>>>>
> >>>>> What we want
> >>>>> ------------
> >>>>>
> >>>>> Increase the amount of code executed in SPMD mode and
the use of
> >>>>> lightweight "guarding" schemes where
appropriate.
> >>>>>
> >>>>>
> >>>>> How we get (could) there
> >>>>> ------------------------
> >>>>>
> >>>>> We propose the following two modifications in order:
> >>>>>
> >>>>>  1) Move the state machine logic into the OpenMP
runtime library. That
> >>>>>     means in SPMD mode all device threads will start
the execution of
> >>>>>     the user code, thus emerge from the runtime, while
in guarded mode
> >>>>>     only the master will escape the runtime and the
other threads will
> >>>>>     idle in their state machine code that is now just
"hidden".
> >>>>>
> >>>>>     Why:
> >>>>>     - The state machine code cannot be (reasonably)
optimized anyway,
> >>>>>       moving it into the library shouldn't hurt
runtime but might even
> >>>>>       improve compile time a little bit.
> >>>>>     - The change should also simplify the Clang code
generation as we
> >>>>>       would generate structurally the same code for
both execution modes
> >>>>>       but only the runtime library calls, or their
arguments, would
> >>>>>       differ between them.
> >>>>>     - The reason we should not "just start in
SPMD mode" and "repair"
> >>>>>       it later is simple, this way we always have
semantically correct
> >>>>>       and executable code.
> >>>>>     - Finally, and most importantly, there is now only
little
> >>>>>       difference (see above) between the two modes in
the code
> >>>>>       generated by clang. If we later analyze the code
trying to decide
> >>>>>       if we can use SPMD mode instead of guarded mode
the analysis and
> >>>>>       transformation becomes much simpler.
> >>>>>
> >>>>> 2) Implement a middle-end LLVM-IR pass that detects
the guarded mode,
> >>>>>    e.g., through the runtime library calls used, and
that tries to
> >>>>>    convert it into the SPMD mode potentially by
introducing lightweight
> >>>>>    guards in the process.
> >>>>>
> >>>>>    Why:
> >>>>>    - After the inliner, and the canonicalizations, we
have a clearer
> >>>>>      picture of the code that is actually executed in
the target
> >>>>>      region and all the side effects it contains.
Thus, we can make an
> >>>>>      educated decision on the required amount of
guards that prevent
> >>>>>      unwanted side effects from happening after a move
to SPMD mode.
> >>>>>    - At this point we can more easily introduce
different schemes to
> >>>>>      avoid side effects by threads that were not
supposed to run. We
> >>>>>      can decide if a state machine is needed,
conditionals should be
> >>>>>      employed, masked instructions are appropriate, or
"dummy" local
> >>>>>      storage can be used to hide the side effect from
the outside
> >>>>>      world.
> >>>>>
> >>>>>
> >>>>> None of this was implemented yet but we plan to start
in the immediate
> >>>>> future. Any comments, ideas, criticism is welcome!
> >>>>>
> >>>>>
> >>>>> Cheers,
> >>>>>  Johannes
> >>>>>
> >>>>>
> >>>>> P.S. [2-4] Provide further information on
implementation and features.
> >>>>>
> >>>>> [1] https://ieeexplore.ieee.org/document/7069297
> >>>>> [2] https://dl.acm.org/citation.cfm?id=2833161
> >>>>> [3] https://dl.acm.org/citation.cfm?id=3018870
> >>>>> [4] https://dl.acm.org/citation.cfm?id=3148189
> >>>>>
> >>>>>
> >>>>> --
> >>>>>
> >>>>> Johannes Doerfert
> >>>>> Researcher
> >>>>>
> >>>>> Argonne National Laboratory
> >>>>> Lemont, IL 60439, USA
> >>>>>
> >>>>> jdoerfert at anl.gov<mailto:jdoerfert at
anl.gov> <mailto:jdoerfert at anl.gov>
> >
> >
> >
> >
> > --
> >
> > Johannes Doerfert
> > Researcher
> >
> > Argonne National Laboratory
> > Lemont, IL 60439, USA
> >
> > jdoerfert at anl.gov<mailto:jdoerfert at anl.gov>
-- 

Johannes Doerfert
Researcher

Argonne National Laboratory
Lemont, IL 60439, USA

jdoerfert at anl.gov
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 228 bytes
Desc: not available
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20190313/6cefcf68/attachment.sig>

Finkel, Hal J. via llvm-dev

2019-Mar-13 23:59 UTC

head link

[llvm-dev] [RFC] Late (OpenMP) GPU code "SPMD-zation"

On 3/13/19 6:29 PM, Doerfert, Johannes wrote:> I get the part about "splitting and testing". I really do.
>
> What I would like you _and others_ to tell me is if this general
> approach, thus not the exact patches currently on phabricator, is what
> we want to pursue. This is important because you seem to disagree on
> this very basic question. You also seem to oppose every design decision,
> especially those that make this approach free-standing.

I, for one, am in favor. Mutating this design such that we split the
responsibilties more naturally - making the optimizer responsible for
optimizing and the frontend responsible for representing semanitcs - is
the right future direction. I don't say that as a criticism of the work
that has been done, but we've learned over time of an increasing number
of places where optimization is useful, and now is the right time start
the process of simplifying the frontend code generation for the
GPU-targeted OpenMP code while moving optimization logic to the IR-level
passes. This makes the frontend code simpler (thanks, Johannes, for
posting patches which seem to demonstrate that) and makes the
optimization logic more robust.

Regarding whether to build up the new logic in a separate file vs adding
changes to the existing frontend logic, I don't have a strong opinion.
If Alexey would prefer we add the new logic to the existing file because
that makes reviewing easier for him, I recommend that we do that. It
does have the advantage of keeping all of the similar logic together,
and that can make it easier to keep everything consistent when there are
changes.

Thanks again,

Hal

>
>
> On 03/13, Alexey Bataev wrote:
>> Johannes, did you try it on AMD GPUs? If not, I think it might be
>> early to claim it as a general interface for NVidia/AMD GPUs. I'm
ok,
>> if you want tointroduce a basic class for the GPU-specific codegen,
>> but it must be done step-by-step and thoroughly tested and reviewed.
>> Theremightbe some parts, common with NVPTX codegen. You can put the
>> commonfunctions into a base class and remove them from NVPTX
>> implementation. But all this must be done in small parts.
>>
>> Best regards,
>> Alexey Bataev
>>
>> 13 марта 2019 г., в 17:49, Doerfert, Johannes <jdoerfert at
anl.gov<mailto:jdoerfert at anl.gov>> написал(а):
>>
>>
>> Alexey,
>>
>>
>> since we seem to disagree on the need for changes as I propose them in
general, as well as how they should be done,
>>
>> I would like to avoid spending more and more time on this before it is
clear where we are heading. Once that is
>>
>> clear, I'm more than happy to split the patches even further and
improve the test coverage. I will then also welcome feedback
>>
>> from you and other reviewers on how to split them.
>>
>>
>> With regards to the NVPTX support, I don't see your problem. We
have NVPTX support through the current code generation,
>>
>> adding this alternative and developing it on the side will not change
that.
>>
>>
>> Obviously, this stuff is not as well tested right now but without
putting it out there it will never be. I will not be able to do the
>>
>> testing alone. Worst case, there is no continues involvement in this
codegen/interface/optimization scheme and we remove
>>
>> it again. Since it does need modifications of any existing code that
will not be a problem.
>>
>>
>> Best regards,
>>
>>   Johannes
>>
>>
>>
>>
>> ________________________________
>> From: Alexey Bataev <a.bataev at hotmail.com<mailto:a.bataev at
hotmail.com>>
>> Sent: Wednesday, March 13, 2019 4:33:03 PM
>> To: Doerfert, Johannes
>> Cc: Alexey Bataev; cfe-dev at lists.llvm.org<mailto:cfe-dev at
lists.llvm.org>; openmp-dev at lists.llvm.org<mailto:openmp-dev at
lists.llvm.org>; llvm-dev; Finkel, Hal J.
>> Subject: Re: [RFC] Late (OpenMP) GPU code "SPMD-zation"
>>
>> 1. You don't need to implement everything in a single patch. The
development process is a step-by-step process, when you commit something in
small pieces. The code must nit be fully functional, you may start from some
basic features. Currently it is very hard to review.
>> 2. I rather doubt that it can be reused without changes for AMD etc.,
especially without being fully tested. The only tested target is NVPTX and at
first we need to support it. Later, we could extend it to AMD and some other
targets.
>> 3. No, it is not incidental. It is thoroughly tested, at least.
>> 4. Hm, if it would be so, I would just ignored it. Yes, I'm a bit
sceptical, but this is normal. It is the fact that these patches break Coding
standard, which suggests to split patches into small pieces and commit them one
by one.
>>
>> Best regards,
>> Alexey Bataev
>>
>>> 13 марта 2019 г., в 17:18, Doerfert, Johannes <jdoerfert at
anl.gov<mailto:jdoerfert at anl.gov>> написал(а):
>>>
>>>> On 03/13, Alexey Bataev wrote:
>>>> 13.03.2019 15:35, Doerfert, Johannes пишет:
>>>>> Hi Alexey,
>>>>>
>>>>>
>>>>> thank you for your quick feedback.
>>>>>
>>>>>
>>>>>> There are tooooooo(!) many changes, I don't
who's going to review sooooo big
>>>>> patch.
>>>>>
>>>>>
>>>>> I can for sure split it in the three
components/repositories that are
>>>>> touched, clang, llvm, and openmp.
>>>>>
>>>>> I feared it will then be harder to navigate the code in
order to see
>>>>> the connection points.
>>>>>
>>>>> I am a bit amazed by your hyperbolism though given the
complexity is
>>>>> not that height
>>>>>
>>>>> due to the absence of modified or removed lines. Anyway,
you seem to
>>>>> have very strong feelings about
>>>>>
>>>>> this so I am open to suggestion on how to split it up.
>>>>>
>>>>>
>>>> 1. You definitely need to split it into separate patches for
different
>>>> components.
>>> Done:
>>>  OpenMP: https://reviews.llvm.org/D59319
>>>   Clang: https://reviews.llvm.org/D59328
>>>    LLVM: https://reviews.llvm.org/D59331
>>>
>>>> 2. Even inside of those components this patch must be split
into several
>>>> small patches, it is very hard to review so big patches.
>>> Please take a look at the three patches above.
>>>
>>> The first contains the interface definition and implementation for
NVPTX
>>> (in cuda). I don't know how to further split that except to
separate it
>>> into the definition and the implementation, though that does not
make
>>> sense to me.
>>>
>>> The second contains the code generation. It is very much like the
NVPTX
>>> code generation except that it does not contain logic.
>>>
>>> The third is the LLVM pass which could be split into two, SPMD-mode
and
>>> state machine creation. I'll wait for feedback on the other
patches
>>> until I go ahead.
>>>
>>>
>>>>>> Also, I don't like the idea adding of one more
class for NVPTX
>>>>> codegen. All your changes should be on top of the eixisting
solution.
>>>>>
>>>>>
>>>>> Could you please explain to me why? This will only make
everything
>>>>> more complicated and entangled.
>>>>> Also, the new class is supposed to be "target
agnostic" so a new
>>>>> offloading target, e.g., AMD GPUs, could easily reuse
>>>>> the new code while the old code is sprinkled with NVPTX
specific
>>>>> details, e.g., function calls, constants, etc.
>>>>>
>>>> 1. As far as I know, even now the NVPTX codegen can be reused
for AMD
>>>> GPUs with some small changes.
>>> The target region code generation is supposed to be reusable for
>>> AMD/XYZ/... without changes.
>>>
>>>
>>>> 2. Your patch is about codegen for NVPTX, so you must change
the
>>>> existing codegen, but not to introduce the new one for the same
target.
>>> I strongly disagree. The patch is not "for NVPTX" but for
"OpenMP target
>>> offloading", maybe with a focus on "GPU kernels".
The fact that the only
>>> target offloading device we currently support is based on Cuda and
NVPTX
>>> is incidental.
>>>
>>>
>>>> There is no point to maintain two different codegens for one
target.
>>> Given your comments on my initial RFC and prototype I strongly
suspected
>>> you do not want this approach to replace the current NVPTX code
>>> generation. Once that changes we can get rid of one of them.
>>>
>>>
>>>>> Thanks again,
>>>>>   Johannes
>>>>>
------------------------------------------------------------------------
>>>>> *From:* Alexey Bataev <a.bataev at
outlook.com<mailto:a.bataev at outlook.com>>
>>>>> *Sent:* Wednesday, March 13, 2019 2:15:39 PM
>>>>> *To:* Doerfert, Johannes; cfe-dev at
lists.llvm.org<mailto:cfe-dev at lists.llvm.org>
>>>>> *Cc:* openmp-dev at lists.llvm.org<mailto:openmp-dev at
lists.llvm.org>; LLVM-Dev; Finkel, Hal J.; Alexey
>>>>> Bataev; Arpith Chacko Jacob
>>>>> *Subject:* Re: [RFC] Late (OpenMP) GPU code
"SPMD-zation"
>>>>>
>>>>>
>>>>> There are tooooooo(!) many changes, I don't who's
going to review
>>>>> sooooo big patch. You definitely need to split it into
several smaller
>>>>> patches. Also, I don't like the idea adding of one more
class for
>>>>> NVPTX codegen. All your changes should be on top of the
eixisting
>>>>> solution.
>>>>>
>>>>> -------------
>>>>> Best regards,
>>>>> Alexey Bataev
>>>>> 13.03.2019 15:08, Doerfert, Johannes пишет:
>>>>>> Please consider reviewing the code for the proposed
approach here:
>>>>>>  https://reviews.llvm.org/D57460
>>>>>>
>>>>>> Initial tests, e.g., on the nw (needleman-wunsch)
benchmark in the
>>>>>> rodinia 3.1 benchmark suite, showed 30% improvement
after SPMD mode was
>>>>>> enabled automatically. The code in nw is conceptually
equivalent to the
>>>>>> first example in the "to_SPMD_mode.ll" test
case that can be found here:
>>>>>>  https://reviews.llvm.org/D57460#change-sBfg7kuN4Bid
>>>>>>
>>>>>> The implementation is missing key features but one
should be able to see
>>>>>> the overall design by now. Once accepted, the missing
features and more
>>>>>> optimizations will be added.
>>>>>>
>>>>>>
>>>>>>> On 01/22, Johannes Doerfert wrote:
>>>>>>> Where we are
>>>>>>> ------------
>>>>>>>
>>>>>>> Currently, when we generate OpenMP target
offloading code for GPUs, we
>>>>>>> use sufficient syntactic criteria to decide between
two execution modes:
>>>>>>>  1)      SPMD -- All target threads (in an OpenMP
team) run all the code.
>>>>>>>  2) "Guarded" -- The master thread (of an
OpenMP team) runs the user
>>>>>>>                  code. If an OpenMP distribute
region is encountered, thus
>>>>>>>                  if all threads (in the OpenMP
team) are supposed to
>>>>>>>                  execute the region, the master
wakes up the idling
>>>>>>>                  worker threads and points them to
the correct piece of
>>>>>>>                  code for distributed execution.
>>>>>>>
>>>>>>> For a variety of reasons we (generally) prefer the
first execution mode.
>>>>>>> However, depending on the code, that might not be
valid, or we might
>>>>>>> just not know if it is in the Clang code generation
phase.
>>>>>>>
>>>>>>> The implementation of the "guarded"
execution mode follows roughly the
>>>>>>> state machine description in [1], though the
implementation is different
>>>>>>> (more general) nowadays.
>>>>>>>
>>>>>>>
>>>>>>> What we want
>>>>>>> ------------
>>>>>>>
>>>>>>> Increase the amount of code executed in SPMD mode
and the use of
>>>>>>> lightweight "guarding" schemes where
appropriate.
>>>>>>>
>>>>>>>
>>>>>>> How we get (could) there
>>>>>>> ------------------------
>>>>>>>
>>>>>>> We propose the following two modifications in
order:
>>>>>>>
>>>>>>>  1) Move the state machine logic into the OpenMP
runtime library. That
>>>>>>>     means in SPMD mode all device threads will
start the execution of
>>>>>>>     the user code, thus emerge from the runtime,
while in guarded mode
>>>>>>>     only the master will escape the runtime and the
other threads will
>>>>>>>     idle in their state machine code that is now
just "hidden".
>>>>>>>
>>>>>>>     Why:
>>>>>>>     - The state machine code cannot be (reasonably)
optimized anyway,
>>>>>>>       moving it into the library shouldn't hurt
runtime but might even
>>>>>>>       improve compile time a little bit.
>>>>>>>     - The change should also simplify the Clang
code generation as we
>>>>>>>       would generate structurally the same code for
both execution modes
>>>>>>>       but only the runtime library calls, or their
arguments, would
>>>>>>>       differ between them.
>>>>>>>     - The reason we should not "just start in
SPMD mode" and "repair"
>>>>>>>       it later is simple, this way we always have
semantically correct
>>>>>>>       and executable code.
>>>>>>>     - Finally, and most importantly, there is now
only little
>>>>>>>       difference (see above) between the two modes
in the code
>>>>>>>       generated by clang. If we later analyze the
code trying to decide
>>>>>>>       if we can use SPMD mode instead of guarded
mode the analysis and
>>>>>>>       transformation becomes much simpler.
>>>>>>>
>>>>>>> 2) Implement a middle-end LLVM-IR pass that detects
the guarded mode,
>>>>>>>    e.g., through the runtime library calls used,
and that tries to
>>>>>>>    convert it into the SPMD mode potentially by
introducing lightweight
>>>>>>>    guards in the process.
>>>>>>>
>>>>>>>    Why:
>>>>>>>    - After the inliner, and the canonicalizations,
we have a clearer
>>>>>>>      picture of the code that is actually executed
in the target
>>>>>>>      region and all the side effects it contains.
Thus, we can make an
>>>>>>>      educated decision on the required amount of
guards that prevent
>>>>>>>      unwanted side effects from happening after a
move to SPMD mode.
>>>>>>>    - At this point we can more easily introduce
different schemes to
>>>>>>>      avoid side effects by threads that were not
supposed to run. We
>>>>>>>      can decide if a state machine is needed,
conditionals should be
>>>>>>>      employed, masked instructions are appropriate,
or "dummy" local
>>>>>>>      storage can be used to hide the side effect
from the outside
>>>>>>>      world.
>>>>>>>
>>>>>>>
>>>>>>> None of this was implemented yet but we plan to
start in the immediate
>>>>>>> future. Any comments, ideas, criticism is welcome!
>>>>>>>
>>>>>>>
>>>>>>> Cheers,
>>>>>>>  Johannes
>>>>>>>
>>>>>>>
>>>>>>> P.S. [2-4] Provide further information on
implementation and features.
>>>>>>>
>>>>>>> [1] https://ieeexplore.ieee.org/document/7069297
>>>>>>> [2] https://dl.acm.org/citation.cfm?id=2833161
>>>>>>> [3] https://dl.acm.org/citation.cfm?id=3018870
>>>>>>> [4] https://dl.acm.org/citation.cfm?id=3148189
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>>
>>>>>>> Johannes Doerfert
>>>>>>> Researcher
>>>>>>>
>>>>>>> Argonne National Laboratory
>>>>>>> Lemont, IL 60439, USA
>>>>>>>
>>>>>>> jdoerfert at anl.gov<mailto:jdoerfert at
anl.gov> <mailto:jdoerfert at anl.gov>
>>>
>>>
>>>
>>> --
>>>
>>> Johannes Doerfert
>>> Researcher
>>>
>>> Argonne National Laboratory
>>> Lemont, IL 60439, USA
>>>
>>> jdoerfert at anl.gov<mailto:jdoerfert at anl.gov>
-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory

Alexey Bataev via llvm-dev

2019-Mar-14 00:31 UTC

head link

[llvm-dev] [RFC] Late (OpenMP) GPU code "SPMD-zation"

No, I'm just saying let's start with one particular target NVPTX and
then, later, when the NVPTX part becomes stable, we can try to generalized it.
The runtime part is written on Cuda, not Hip for AMD.
Also, I said, we can try to generalize the interface of the compiler support.
But you need to merge the common parts between existing NVPTX runtime class and
your new one. And you need to do it step by step.

Best regards,
Alexey Bataev
> 13 марта 2019 г., в 19:29, Doerfert, Johannes <jdoerfert at anl.gov>
написал(а):
> 
> I get the part about "splitting and testing". I really do.
> 
> What I would like you _and others_ to tell me is if this general
> approach, thus not the exact patches currently on phabricator, is what
> we want to pursue. This is important because you seem to disagree on
> this very basic question. You also seem to oppose every design decision,
> especially those that make this approach free-standing.
> 
> 
>> On 03/13, Alexey Bataev wrote:
>> Johannes, did you try it on AMD GPUs? If not, I think it might be
>> early to claim it as a general interface for NVidia/AMD GPUs. I'm
ok,
>> if you want tointroduce a basic class for the GPU-specific codegen,
>> but it must be done step-by-step and thoroughly tested and reviewed.
>> Theremightbe some parts, common with NVPTX codegen. You can put the
>> commonfunctions into a base class and remove them from NVPTX
>> implementation. But all this must be done in small parts.
>> 
>> Best regards,
>> Alexey Bataev
>> 
>> 13 марта 2019 г., в 17:49, Doerfert, Johannes <jdoerfert at
anl.gov<mailto:jdoerfert at anl.gov>> написал(а):
>> 
>> 
>> Alexey,
>> 
>> 
>> since we seem to disagree on the need for changes as I propose them in
general, as well as how they should be done,
>> 
>> I would like to avoid spending more and more time on this before it is
clear where we are heading. Once that is
>> 
>> clear, I'm more than happy to split the patches even further and
improve the test coverage. I will then also welcome feedback
>> 
>> from you and other reviewers on how to split them.
>> 
>> 
>> With regards to the NVPTX support, I don't see your problem. We
have NVPTX support through the current code generation,
>> 
>> adding this alternative and developing it on the side will not change
that.
>> 
>> 
>> Obviously, this stuff is not as well tested right now but without
putting it out there it will never be. I will not be able to do the
>> 
>> testing alone. Worst case, there is no continues involvement in this
codegen/interface/optimization scheme and we remove
>> 
>> it again. Since it does need modifications of any existing code that
will not be a problem.
>> 
>> 
>> Best regards,
>> 
>>  Johannes
>> 
>> 
>> 
>> 
>> ________________________________
>> From: Alexey Bataev <a.bataev at hotmail.com<mailto:a.bataev at
hotmail.com>>
>> Sent: Wednesday, March 13, 2019 4:33:03 PM
>> To: Doerfert, Johannes
>> Cc: Alexey Bataev; cfe-dev at lists.llvm.org<mailto:cfe-dev at
lists.llvm.org>; openmp-dev at lists.llvm.org<mailto:openmp-dev at
lists.llvm.org>; llvm-dev; Finkel, Hal J.
>> Subject: Re: [RFC] Late (OpenMP) GPU code "SPMD-zation"
>> 
>> 1. You don't need to implement everything in a single patch. The
development process is a step-by-step process, when you commit something in
small pieces. The code must nit be fully functional, you may start from some
basic features. Currently it is very hard to review.
>> 2. I rather doubt that it can be reused without changes for AMD etc.,
especially without being fully tested. The only tested target is NVPTX and at
first we need to support it. Later, we could extend it to AMD and some other
targets.
>> 3. No, it is not incidental. It is thoroughly tested, at least.
>> 4. Hm, if it would be so, I would just ignored it. Yes, I'm a bit
sceptical, but this is normal. It is the fact that these patches break Coding
standard, which suggests to split patches into small pieces and commit them one
by one.
>> 
>> Best regards,
>> Alexey Bataev
>> 
>>>> 13 марта 2019 г., в 17:18, Doerfert, Johannes <jdoerfert at
anl.gov<mailto:jdoerfert at anl.gov>> написал(а):
>>>> 
>>>>> On 03/13, Alexey Bataev wrote:
>>>>> 13.03.2019 15:35, Doerfert, Johannes пишет:
>>>>> 
>>>>> Hi Alexey,
>>>>> 
>>>>> 
>>>>> thank you for your quick feedback.
>>>>> 
>>>>> 
>>>>>> There are tooooooo(!) many changes, I don't
who's going to review sooooo big
>>>>> patch.
>>>>> 
>>>>> 
>>>>> I can for sure split it in the three
components/repositories that are
>>>>> touched, clang, llvm, and openmp.
>>>>> 
>>>>> I feared it will then be harder to navigate the code in
order to see
>>>>> the connection points.
>>>>> 
>>>>> I am a bit amazed by your hyperbolism though given the
complexity is
>>>>> not that height
>>>>> 
>>>>> due to the absence of modified or removed lines. Anyway,
you seem to
>>>>> have very strong feelings about
>>>>> 
>>>>> this so I am open to suggestion on how to split it up.
>>>>> 
>>>>> 
>>>> 
>>>> 1. You definitely need to split it into separate patches for
different
>>>> components.
>>> 
>>> Done:
>>> OpenMP: https://reviews.llvm.org/D59319
>>>  Clang: https://reviews.llvm.org/D59328
>>>   LLVM: https://reviews.llvm.org/D59331
>>> 
>>>> 2. Even inside of those components this patch must be split
into several
>>>> small patches, it is very hard to review so big patches.
>>> 
>>> Please take a look at the three patches above.
>>> 
>>> The first contains the interface definition and implementation for
NVPTX
>>> (in cuda). I don't know how to further split that except to
separate it
>>> into the definition and the implementation, though that does not
make
>>> sense to me.
>>> 
>>> The second contains the code generation. It is very much like the
NVPTX
>>> code generation except that it does not contain logic.
>>> 
>>> The third is the LLVM pass which could be split into two, SPMD-mode
and
>>> state machine creation. I'll wait for feedback on the other
patches
>>> until I go ahead.
>>> 
>>> 
>>>>>> Also, I don't like the idea adding of one more
class for NVPTX
>>>>> codegen. All your changes should be on top of the eixisting
solution.
>>>>> 
>>>>> 
>>>>> Could you please explain to me why? This will only make
everything
>>>>> more complicated and entangled.
>>>>> Also, the new class is supposed to be "target
agnostic" so a new
>>>>> offloading target, e.g., AMD GPUs, could easily reuse
>>>>> the new code while the old code is sprinkled with NVPTX
specific
>>>>> details, e.g., function calls, constants, etc.
>>>>> 
>>>> 1. As far as I know, even now the NVPTX codegen can be reused
for AMD
>>>> GPUs with some small changes.
>>> 
>>> The target region code generation is supposed to be reusable for
>>> AMD/XYZ/... without changes.
>>> 
>>> 
>>>> 2. Your patch is about codegen for NVPTX, so you must change
the
>>>> existing codegen, but not to introduce the new one for the same
target.
>>> 
>>> I strongly disagree. The patch is not "for NVPTX" but for
"OpenMP target
>>> offloading", maybe with a focus on "GPU kernels".
The fact that the only
>>> target offloading device we currently support is based on Cuda and
NVPTX
>>> is incidental.
>>> 
>>> 
>>>> There is no point to maintain two different codegens for one
target.
>>> 
>>> Given your comments on my initial RFC and prototype I strongly
suspected
>>> you do not want this approach to replace the current NVPTX code
>>> generation. Once that changes we can get rid of one of them.
>>> 
>>> 
>>>>> 
>>>>> Thanks again,
>>>>>  Johannes
>>>>>
------------------------------------------------------------------------
>>>>> *From:* Alexey Bataev <a.bataev at
outlook.com<mailto:a.bataev at outlook.com>>
>>>>> *Sent:* Wednesday, March 13, 2019 2:15:39 PM
>>>>> *To:* Doerfert, Johannes; cfe-dev at
lists.llvm.org<mailto:cfe-dev at lists.llvm.org>
>>>>> *Cc:* openmp-dev at lists.llvm.org<mailto:openmp-dev at
lists.llvm.org>; LLVM-Dev; Finkel, Hal J.; Alexey
>>>>> Bataev; Arpith Chacko Jacob
>>>>> *Subject:* Re: [RFC] Late (OpenMP) GPU code
"SPMD-zation"
>>>>> 
>>>>> 
>>>>> There are tooooooo(!) many changes, I don't who's
going to review
>>>>> sooooo big patch. You definitely need to split it into
several smaller
>>>>> patches. Also, I don't like the idea adding of one more
class for
>>>>> NVPTX codegen. All your changes should be on top of the
eixisting
>>>>> solution.
>>>>> 
>>>>> -------------
>>>>> Best regards,
>>>>> Alexey Bataev
>>>>> 13.03.2019 15:08, Doerfert, Johannes пишет:
>>>>>> Please consider reviewing the code for the proposed
approach here:
>>>>>> https://reviews.llvm.org/D57460
>>>>>> 
>>>>>> Initial tests, e.g., on the nw (needleman-wunsch)
benchmark in the
>>>>>> rodinia 3.1 benchmark suite, showed 30% improvement
after SPMD mode was
>>>>>> enabled automatically. The code in nw is conceptually
equivalent to the
>>>>>> first example in the "to_SPMD_mode.ll" test
case that can be found here:
>>>>>> https://reviews.llvm.org/D57460#change-sBfg7kuN4Bid
>>>>>> 
>>>>>> The implementation is missing key features but one
should be able to see
>>>>>> the overall design by now. Once accepted, the missing
features and more
>>>>>> optimizations will be added.
>>>>>> 
>>>>>> 
>>>>>>> On 01/22, Johannes Doerfert wrote:
>>>>>>> Where we are
>>>>>>> ------------
>>>>>>> 
>>>>>>> Currently, when we generate OpenMP target
offloading code for GPUs, we
>>>>>>> use sufficient syntactic criteria to decide between
two execution modes:
>>>>>>> 1)      SPMD -- All target threads (in an OpenMP
team) run all the code.
>>>>>>> 2) "Guarded" -- The master thread (of an
OpenMP team) runs the user
>>>>>>>                 code. If an OpenMP distribute
region is encountered, thus
>>>>>>>                 if all threads (in the OpenMP team)
are supposed to
>>>>>>>                 execute the region, the master
wakes up the idling
>>>>>>>                 worker threads and points them to
the correct piece of
>>>>>>>                 code for distributed execution.
>>>>>>> 
>>>>>>> For a variety of reasons we (generally) prefer the
first execution mode.
>>>>>>> However, depending on the code, that might not be
valid, or we might
>>>>>>> just not know if it is in the Clang code generation
phase.
>>>>>>> 
>>>>>>> The implementation of the "guarded"
execution mode follows roughly the
>>>>>>> state machine description in [1], though the
implementation is different
>>>>>>> (more general) nowadays.
>>>>>>> 
>>>>>>> 
>>>>>>> What we want
>>>>>>> ------------
>>>>>>> 
>>>>>>> Increase the amount of code executed in SPMD mode
and the use of
>>>>>>> lightweight "guarding" schemes where
appropriate.
>>>>>>> 
>>>>>>> 
>>>>>>> How we get (could) there
>>>>>>> ------------------------
>>>>>>> 
>>>>>>> We propose the following two modifications in
order:
>>>>>>> 
>>>>>>> 1) Move the state machine logic into the OpenMP
runtime library. That
>>>>>>>    means in SPMD mode all device threads will start
the execution of
>>>>>>>    the user code, thus emerge from the runtime,
while in guarded mode
>>>>>>>    only the master will escape the runtime and the
other threads will
>>>>>>>    idle in their state machine code that is now
just "hidden".
>>>>>>> 
>>>>>>>    Why:
>>>>>>>    - The state machine code cannot be (reasonably)
optimized anyway,
>>>>>>>      moving it into the library shouldn't hurt
runtime but might even
>>>>>>>      improve compile time a little bit.
>>>>>>>    - The change should also simplify the Clang code
generation as we
>>>>>>>      would generate structurally the same code for
both execution modes
>>>>>>>      but only the runtime library calls, or their
arguments, would
>>>>>>>      differ between them.
>>>>>>>    - The reason we should not "just start in
SPMD mode" and "repair"
>>>>>>>      it later is simple, this way we always have
semantically correct
>>>>>>>      and executable code.
>>>>>>>    - Finally, and most importantly, there is now
only little
>>>>>>>      difference (see above) between the two modes
in the code
>>>>>>>      generated by clang. If we later analyze the
code trying to decide
>>>>>>>      if we can use SPMD mode instead of guarded
mode the analysis and
>>>>>>>      transformation becomes much simpler.
>>>>>>> 
>>>>>>> 2) Implement a middle-end LLVM-IR pass that detects
the guarded mode,
>>>>>>>   e.g., through the runtime library calls used, and
that tries to
>>>>>>>   convert it into the SPMD mode potentially by
introducing lightweight
>>>>>>>   guards in the process.
>>>>>>> 
>>>>>>>   Why:
>>>>>>>   - After the inliner, and the canonicalizations,
we have a clearer
>>>>>>>     picture of the code that is actually executed
in the target
>>>>>>>     region and all the side effects it contains.
Thus, we can make an
>>>>>>>     educated decision on the required amount of
guards that prevent
>>>>>>>     unwanted side effects from happening after a
move to SPMD mode.
>>>>>>>   - At this point we can more easily introduce
different schemes to
>>>>>>>     avoid side effects by threads that were not
supposed to run. We
>>>>>>>     can decide if a state machine is needed,
conditionals should be
>>>>>>>     employed, masked instructions are appropriate,
or "dummy" local
>>>>>>>     storage can be used to hide the side effect
from the outside
>>>>>>>     world.
>>>>>>> 
>>>>>>> 
>>>>>>> None of this was implemented yet but we plan to
start in the immediate
>>>>>>> future. Any comments, ideas, criticism is welcome!
>>>>>>> 
>>>>>>> 
>>>>>>> Cheers,
>>>>>>> Johannes
>>>>>>> 
>>>>>>> 
>>>>>>> P.S. [2-4] Provide further information on
implementation and features.
>>>>>>> 
>>>>>>> [1] https://ieeexplore.ieee.org/document/7069297
>>>>>>> [2] https://dl.acm.org/citation.cfm?id=2833161
>>>>>>> [3] https://dl.acm.org/citation.cfm?id=3018870
>>>>>>> [4] https://dl.acm.org/citation.cfm?id=3148189
>>>>>>> 
>>>>>>> 
>>>>>>> --
>>>>>>> 
>>>>>>> Johannes Doerfert
>>>>>>> Researcher
>>>>>>> 
>>>>>>> Argonne National Laboratory
>>>>>>> Lemont, IL 60439, USA
>>>>>>> 
>>>>>>> jdoerfert at anl.gov<mailto:jdoerfert at
anl.gov> <mailto:jdoerfert at anl.gov>
>>> 
>>> 
>>> 
>>> 
>>> --
>>> 
>>> Johannes Doerfert
>>> Researcher
>>> 
>>> Argonne National Laboratory
>>> Lemont, IL 60439, USA
>>> 
>>> jdoerfert at anl.gov<mailto:jdoerfert at anl.gov>
> 
> -- 
> 
> Johannes Doerfert
> Researcher
> 
> Argonne National Laboratory
> Lemont, IL 60439, USA
> 
> jdoerfert at anl.gov

llvm dev - Mar 2019 - [RFC] Late (OpenMP) GPU code "SPMD-zation"

[llvm-dev] [RFC] Late (OpenMP) GPU code "SPMD-zation"

[llvm-dev] [RFC] Late (OpenMP) GPU code "SPMD-zation"

[llvm-dev] [RFC] Late (OpenMP) GPU code "SPMD-zation"

[llvm-dev] [RFC] Late (OpenMP) GPU code "SPMD-zation"