thr3ads.net - llvm dev - [llvm-dev] [RFC] Late (OpenMP) GPU code "SPMD-zation" [Jan 2019]

If this information is useful, please help other people find it:
Share via:

Gheorghe-Teod Bercea via llvm-dev

2019-Jan-31 15:05 UTC

[llvm-dev] [RFC] Late (OpenMP) GPU code "SPMD-zation"

<font size=2 face="sans-serif">Hi
Johannes,</font><br><br><font size=2
face="sans-serif">Thank you for the
explanation.</font><br><br><font size=2
face="sans-serif">I think we need to clarify some details
about code generation in Clang today:</font><br><br><font
size=2 face="sans-serif">1. non-SPMD mode, or generic mode, uses
the master-worker code gen scheme where the master thread and the worker
threads are <b>disjoint </b>sets of threads (when one set runs the
other
set is blocked and doesn't participate in the
execution):</font><br><br><font size=2
face="Courier">workers  |  master</font><br><font
size=2
face="Courier">====================</font><br><font
size=2 face="Courier">BLOCKED  |
<b>RUNNING</b></font><br><font size=2
face="Courier">------- sync -------</font><br><font
size=2 face="Courier"><b>RUNNING  </b>|
BLOCKED</font><br><font size=2
face="Courier">------- sync -------</font><br><font
size=2 face="Courier">BLOCKED  |
<b>RUNNING</b></font><br><br><br><font
size=2 face="sans-serif">2. the worker threads, in their RUNNING
state above, contain a state machine which chooses the parallel region
to be executed. Today this choice happens in one of two ways: explicit
targets (where you know what outlined region you are calling and you just
call it) and indirect targets (via function pointer set by master thread
in one of its RUNNING regions):</font><br><br><font size=2
face="Courier">workers  |  master</font><br><font
size=2
face="Courier">====================</font><br><font
size=2 face="Courier">BLOCKED  |
<b>RUNNING</b></font><br><font size=2
face="Courier">------- sync -------</font><br><font
size=2 face="Courier"><b>RUNNING 
</b>|</font><br><font size=2
face="Courier"><b> state   </b>|
BLOCKED</font><br><font size=2
face="Courier"><b>machine 
</b>|</font><br><font size=2
face="Courier">------- sync -------</font><br><font
size=2 face="Courier">BLOCKED  |
<b>RUNNING</b></font><br><br><br><font
size=2 face="sans-serif">Your intended changes (only target the
RUNNING state machine of the WORKERS):</font><br><font size=2
face="sans-serif">- remove explicit targets from current
code gen. (by itself this is a major step back!!)</font><br><font
size=2 face="sans-serif">- introduce a pass in LLVM which will
add back the explicit targets.</font><br><br><font size=2
face="sans-serif">Can you point out any major improvements
this will bring compared to the current state?</font><br><font
size=2 face="sans-serif">From your answer below you mention a
lower number of function calls. Since today we inline everything anyway
how does that help?</font><br><font size=2
face="sans-serif">If you haven't considered performance
so far how come you're proposing all these changes? What led you to propose
all these changes?</font><br><br><br><font size=2
face="sans-serif">In SPMD mode all threads execute the
same code. Using the notation in the schemes above you can depict this
as:</font><br><br><font size=2 face="Courier">
all threads</font><br><font size=2
face="Courier">====================</font><br><font
size=2 face="Courier"><b>     
RUNNING</b></font><br><br><font size=2
face="sans-serif">No state machine being used, no disjoints
sets of threads. This is as if you're executing CUDA
code.</font><br><br><font size=2
face="sans-serif">Could you explain what your proposed
changes are in this context?</font><br><font size=2
face="sans-serif">Could you also explain what you mean
by "</font><tt><font size=2>assuming SPMD wasn't
achieved</font></tt><font size=2
face="sans-serif">"?</font><br><font size=2
face="sans-serif">Do you expect to write another LLVM
pass which will transform the master-worker scheme + state machine into
an SPMD scheme?</font><br><br><font size=2
face="sans-serif">Thanks,</font><br><br><font
size=2
face="sans-serif">--Doru</font><br><br><br><font
size=2
face="sans-serif"><br></font><br><br><br><br><font
size=1 color=#5f5f5f face="sans-serif">From:
 </font><font size=1 face="sans-serif">"Doerfert,
Johannes"
<jdoerfert@anl.gov></font><br><font size=1 color=#5f5f5f
face="sans-serif">To:
 </font><font size=1 face="sans-serif">Gheorghe-Teod
Bercea
<Gheorghe-Teod.Bercea@ibm.com></font><br><font size=1
color=#5f5f5f face="sans-serif">Cc:
 </font><font size=1 face="sans-serif">Alexey Bataev
<a.bataev@outlook.com>,
Guray Ozen <gozen@nvidia.com>, "Gregory.Rodgers@amd.com"
<Gregory.Rodgers@amd.com>, "Finkel, Hal J."
<hfinkel@anl.gov>,
"kli@ca.ibm.com" <kli@ca.ibm.com>,
"openmp-dev@lists.llvm.org"
<openmp-dev@lists.llvm.org>, LLVM-Dev <llvm-dev@lists.llvm.org>,
"cfe-dev@lists.llvm.org"
<cfe-dev@lists.llvm.org></font><br><font size=1
color=#5f5f5f face="sans-serif">Date:
 </font><font size=1 face="sans-serif">01/30/2019 07:56
PM</font><br><font size=1 color=#5f5f5f
face="sans-serif">Subject:
   </font><font size=1 face="sans-serif">Re: [RFC] Late
(OpenMP) GPU code "SPMD-zation"</font><br><hr
noshade><br><br><br><tt><font size=2>Hi
Doru,<br><br>[+ llvm-dev and cfe-dev]<br><br>On 01/30,
Gheorghe-Teod Bercea wrote:<br>> Hi Johannes,<br>>
<br>> First of all thanks for looking into the matter of improving
non-SPMD
mode!<br>> <br>> I have a question regarding the state machine
that you said you'd
like to<br>> replace/improve. There are cases (such as target regions
that span
multiple<br>> compilation units) where the switch statement is
required. Is this
something<br>> that your changes will touch in any
way?<br><br>There will not be a difference. Let me explain in some
details as there<br>seems to be a lot of confusion on this state machine
topic:<br><br>Now:<br><br>Build a state machine in the
user code (module) with all the parallel<br>regions as explicit targets of
the switch statement and a fallback<br>default that does a indirect call
to the requested parallel region.<br><br><br>Proposed, after
Clang:<br><br>Use the runtime state machine implementation [0] which
reduces the<br>switch to the default case, thus an indirect call to the
requested<br>parallel region. This will always work, regardless of the
translation<br>unit that contained the parallel region
(pointer).<br><br>Proposed, after OpenMP-Opt pass in LLVM (assuming
SPMD wasn't achieved):<br><br>All reachable parallel regions in
a kernel are collected and used to<br>create the switch statement in the
user code (module) [1, line 111] with<br>a fallback if there are
potentially [1, line 212] hidden
parallel<br>regions.<br><br><br>Does that make
sense?<br><br><br>[0] </font></tt><a
href="https://reviews.llvm.org/D57460#change-e9Ljd9RgdWYz"><tt><font
size=2>https://reviews.llvm.org/D57460#change-e9Ljd9RgdWYz</font></tt></a><tt><font
size=2><br>[1] </font></tt><a
href="https://reviews.llvm.org/D57460#change-8gnnGNfJVR4B"><tt><font
size=2>https://reviews.llvm.org/D57460#change-8gnnGNfJVR4B</font></tt></a><tt><font
size=2><br><br><br>> My next question is, for the
workloads which are in the same compilation
unit<br>> there is a trick that code gen performs (or could perform
I'm not
sure if this<br>> has been upstreamed) where it can check for the
specific name of an
outlined<br>> function and then just call it directly thus making that
function
inline-able<br>> (thus erasing most if not all the overhead of having
the state machine
in the<br>> first place). In other words the "worst" part of
the switch
statement will only<br>> apply to outlined functions from other
compilation units. With this
in mind<br>> what would the impact of your changes be in the end? If
this part
isn't clear I<br>> can do some digging to find out how this
actually works in more details
it's<br>> been too long since I've had to look at this
part.<br><br>See the answer above.<br><br><br>>
Can you share some performance numbers given an example you have been
looking<br>> at? I see you have one that uses "#pragma omp
atomic". I
would avoid using<br>> something like that since it may have other
overheads not related
to your<br>> changes. I would put together an example with this
directive structure:<br>> <br>> #pragma omp target teams
distribute<br>> for(...){<br>>   <code1><br>>  
#pragma omp parallel for<br>>   for(...) {<br>>    
<code2><br>>   }<br>>   <code3><br>>
}<br>> <br>> which forces the use of the master-worker scheme
(non-SPMD mode) without
any<br>> other distractions.<br><br>The atomic stuff I used
to determine correctness. I haven't yet looked<br>at performance. I
will do so now and inform you on my results.<br><br><br>>
It would then be interesting to understand how you plan to change
the LLVM code<br>> generated for this,<br><br>The examples
show how the LLVM-IR is supposed to look like, right?<br><br>>
what the overheads that you're targeting are (register usage,<br>>
synchronization cost etc), and then what the performance gain is<br>>
compared to the current scheme.<br><br>I can also compare register
usage in addition to performance but there<br>is no difference in
synchronization. The number and (relative) order of<br>original runtime
library calls stays the same. The number of user code<br>-> runtime
library calls is even decreased.<br><br><br>Please let me know
if this helps and what questions remain.<br><br>Thanks,<br> 
Johannes<br><br><br> <br>> From:       
"Doerfert, Johannes" <jdoerfert@anl.gov><br>> To:      
Alexey Bataev <a.bataev@outlook.com><br>> Cc:        Guray Ozen
<gozen@nvidia.com>,
Gheorghe-Teod Bercea<br>> <gheorghe-teod.bercea@ibm.com>,
"openmp-dev@lists.llvm.org"<br>>
<openmp-dev@lists.llvm.org>, "Finkel, Hal J."
<hfinkel@anl.gov>,<br>> "Gregory.Rodgers@amd.com"
<Gregory.Rodgers@amd.com>,
"kli@ca.ibm.com"<br>> <kli@ca.ibm.com><br>>
Date:        01/30/2019 04:14 PM<br>> Subject:        Re: [RFC] Late
(OpenMP) GPU code
"SPMD-zation"<br>>
$B(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(B<br>>
<br>> <br>> <br>> I don't really see "many
ifs and maybes", actually none.<br>> <br>> Anyway, I will
now work on a patch set that adds the new functionality
under a<br>> cmd flag<br>> in order to showcase correctness
and performance on real code.<br>> <br>> If you, or somebody
else, have interesting examples, please feel free
to point<br>> me at them.<br>> <br>>
Thanks,<br>>   Johannes<br>> <br>> <br>>
$B(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(B<br>>
<br>> From: Alexey Bataev <a.bataev@outlook.com><br>>
Sent: Wednesday, January 30, 2019 2:18:19 PM<br>> To: Doerfert,
Johannes<br>> Cc: Guray Ozen; Gheorghe-Teod Bercea;
openmp-dev@lists.llvm.org; Finkel,
Hal<br>> J.; Gregory.Rodgers@amd.com; kli@ca.ibm.com<br>>
Subject: Re: [RFC] Late (OpenMP) GPU code "SPMD-zation"<br>> 
<br>> Currently, there are too many "if"s and
"maybe"s.
If you can provide solution<br>> that does not break anything and does
not affect the performance,
does not<br>> require changes in the backend - then go ahead with the
patches.<br>> <br>> -------------<br>> Best
regards,<br>> Alexey Bataev<br>> 30.01.2019 14:49, Doerfert,
Johannes      :<br>> No, SPMD mode will not be affected at
all.<br>> <br>> The "worse" part is the
following:<br>>   If we inline runtime library calls before the
openmp-opt pass
had a chance to<br>> look at the code,<br>>   we will not have
a customized state machine for the __non-SPMD__
case. That<br>> is, the if-cascade<br>>   checking the work
function pointer is not there.<br>> <br>> Avoiding this
potential performance decline is actually very easy.
While we do<br>> not have the "inline_late"
capability,<br>> run the openmp-opt pass __before__ the inliner and we
will not get
"worse"<br>> code. We might however miss out on<br>>
_new_ non-SPMD -> SPMD transformations.<br>> <br>>
<br>> Does that make sense?<br>> <br>>
$B(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(B<br>>
<br>> From: Alexey Bataev <a.bataev@outlook.com><br>>
Sent: Wednesday, January 30, 2019 1:44:10 PM<br>> To: Doerfert,
Johannes<br>> Cc: Guray Ozen; Gheorghe-Teod Bercea;
openmp-dev@lists.llvm.org; Finkel,
Hal<br>> J.; Gregory.Rodgers@amd.com; kli@ca.ibm.com<br>>
Subject: Re: [RFC] Late (OpenMP) GPU code "SPMD-zation"<br>> 
<br>> Any "worse" is not a good idea. We need to avoid it. It
would be good that the<br>> new code did not affect the performance,
especially for SPMD mode
(I think,<br>> this "worse" will affect exactly SPMD mode,
no?)<br>> <br>> -------------<br>> Best
regards,<br>> Alexey Bataev<br>> 30.01.2019 14:38, Doerfert,
Johannes      :<br>> The LLVM optimization (openmp-opt), which does
non-SPMD -> SPMD
and custom<br>> state machine generation, will not fire
if<br>> the __kernel_general_... calls are "missing". Thus if
we
inline "to early", we<br>> are "stuck" with the
non-SPMD choice (not worse than<br>> what we have now!) and the default
library state machine ("worse"
than what we<br>> have right now). Does that make sense?<br>>
<br>> The second option described what I want to see us do
"later"
in order to avoid<br>> the above scenario and always get
both,<br>> openmp-opt and inlining of the runtime and work
functions.<br>> <br>> <br>>
$B(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(B<br>>
<br>> From: Alexey Bataev <a.bataev@outlook.com><br>>
Sent: Wednesday, January 30, 2019 1:25:42 PM<br>> To: Doerfert,
Johannes<br>> Cc: Guray Ozen; Gheorghe-Teod Bercea;
openmp-dev@lists.llvm.org; Finkel,
Hal<br>> J.; Gregory.Rodgers@amd.com; kli@ca.ibm.com<br>>
Subject: Re: [RFC] Late (OpenMP) GPU code "SPMD-zation"<br>> 
<br>> Sorry, did not understand your answer correctly. But you
wrote:<br>> for now, not doing the optimization is just
fine.<br>> What do you mean?<br>> <br>>
-------------<br>> Best regards,<br>> Alexey
Bataev<br>> 30.01.2019 14:23, Doerfert, Johannes      :<br>>
Alexey,<br>> <br>> I'm not sure how to interpret "Bad
idea!". but I think there
is again a<br>> misunderstanding.<br>> To help me understand,
could you try to elaborate a bit?<br>> <br>> To make my last
email clear:<br>> I __do__ want inlining. Both answers to your earlier
inlining questions
do<br>> actually assume the runtime library calls __are eventually
inlined__,<br>> that is why I mentioned LTO and the runtime as
bitcode.<br>> .<br>> Cheers,<br>>  
Johannes<br>> <br>> <br>> <br>>
$B(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(B<br>>
<br>> From: Alexey Bataev <a.bataev@outlook.com><br>>
Sent: Wednesday, January 30, 2019 1:14:56 PM<br>> To: Doerfert,
Johannes<br>> Cc: Guray Ozen; Gheorghe-Teod Bercea;
openmp-dev@lists.llvm.org; Finkel,
Hal<br>> J.; Gregory.Rodgers@amd.com; kli@ca.ibm.com<br>>
Subject: Re: [RFC] Late (OpenMP) GPU code "SPMD-zation"<br>> 
<br>> Bad idea!<br>> <br>>
-------------<br>> Best regards,<br>> Alexey
Bataev<br>> 30.01.2019 14:11, Doerfert, Johannes      :<br>>
Sure I do. Why do you think I don't?<br>> <br>>
$B(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(B<br>>
<br>> From: Alexey Bataev <a.bataev@outlook.com><br>>
Sent: Wednesday, January 30, 2019 1:00:59 PM<br>> To: Doerfert,
Johannes<br>> Cc: Guray Ozen; Gheorghe-Teod Bercea;
openmp-dev@lists.llvm.org; Finkel,
Hal<br>> J.; Gregory.Rodgers@amd.com; kli@ca.ibm.com<br>>
Subject: Re: [RFC] Late (OpenMP) GPU code "SPMD-zation"<br>> 
<br>> You don't want to do the inlining?<br>>
<br>> -------------<br>> Best regards,<br>> Alexey
Bataev<br>> 30.01.2019 13:59, Doerfert, Johannes      :<br>> -
for now, not doing the optimization is just fine. The whole idea
is that code<br>> is always valid.<br>> <br>>
<br><br>-- <br><br>Johannes
Doerfert<br>Researcher<br><br>Argonne National
Laboratory<br>Lemont, IL 60439,
USA<br><br>jdoerfert@anl.gov<br>[attachment
"signature.asc" deleted by Gheorghe-Teod Bercea/US/IBM]
</font></tt><br><br><BR>

Doerfert, Johannes via llvm-dev

2019-Jan-31 17:34 UTC

head link

[llvm-dev] [RFC] Late (OpenMP) GPU code "SPMD-zation"

Hi Doru,

maybe I should clarify something I mentioned in an earlier email already
but it seems there are things getting lost in this thread:

  While the prototype replaces code generation parts in Clang, the
  actual patches will add alternative code generation paths, guarded
  under a cmd flag. Once, and obviously only if, everything is in place
  and has been shown to improve the current situation, the default path
  would be switched.


On 01/31, Gheorghe-Teod Bercea wrote:> Hi Johannes,
> 
> Thank you for the explanation.
> 
> I think we need to clarify some details about code generation in Clang
today:
I'm not really sure why you feel the need to do that but OK.

> 1. non-SPMD mode, or generic mode, uses the master-worker code gen scheme
where
> the master thread and the worker threads are disjoint sets of threads (when
one
> set runs the other set is blocked and doesn't participate in the
execution):
> 
> workers  |  master
> ===================> BLOCKED  | RUNNING
> ------- sync -------
> RUNNING  | BLOCKED
> ------- sync -------
> BLOCKED  | RUNNING
I agree, and for the record, this is not changed by my prototype, see
[1, line 295].

[1] https://reviews.llvm.org/D57460#change-e9Ljd9RgdWYz

> 2. the worker threads, in their RUNNING state above, contain a state
machine
> which chooses the parallel region to be executed. Today this choice happens
in
> one of two ways: explicit targets (where you know what outlined region you
are
> calling and you just call it) and indirect targets (via function pointer
set by
> master thread in one of its RUNNING regions):
> 
> workers  |  master
> ===================> BLOCKED  | RUNNING
> ------- sync -------
> RUNNING  |
>  state   | BLOCKED
> machine  |
> ------- sync -------
> BLOCKED  | RUNNING
Partially agreed. Afaik, it will always be decided through a function
pointer set by the master thread and communicated to the workers through
the runtime. The workers use a switch, or in fact an if-cascade, to
check if the function pointer points to a known parallel region. If so
it will be called directly, otherwise there is the fallback indirect
call of the function pointer.
> Your intended changes (only target the RUNNING state machine of the
WORKERS):
> - remove explicit targets from current code gen. (by itself this is a major
> step back!!)
> - introduce a pass in LLVM which will add back the explicit targets.
Simplified but correct. From my perspective this is not a problem
because in production I will always run the LLVM passes after Clang.
Even if you do not run the LLVM passes, the below reasoning might be
enough to convince people to run a similar pass in their respective
pipeline. If that is not enough, we can also keep the Clang state
machine generation around (see the top comment).

> Can you point out any major improvements this will bring compared to the
> current state?
Sure, I'll give you three for now:

[FIRST]
Here is the original motivation from the first RFC mail (in case you
have missed it):

 2) Implement a middle-end LLVM-IR pass that detects the guarded mode,
    e.g., through the runtime library calls used, and that tries to
    convert it into the SPMD mode potentially by introducing lightweight
    guards in the process.

    Why:
    - After the inliner, the canonicalizations, dead code elimination,
      code movement [2, Section 7 on page 8], we have a clearer picture
      of the code that is actually executed in the target region and all
      the side effects it contains. Thus, we can make an educated
      decision on the required amount of guards that prevent unwanted
      side effects from happening after a move to SPMD mode.
    - At this point we can more easily introduce different schemes to
      avoid side effects by threads that were not supposed to run. We
      can decide if a state machine is needed, conditionals should be
      employed, masked instructions are appropriate, or "dummy" local
      storage can be used to hide the side effect from the outside
      world.

[2] http://compilers.cs.uni-saarland.de/people/doerfert/par_opt18.pdf


Let me give you the canonical example that shows the need for this:

  #pragma omp target teams
  {
    foo(i + 0)
    foo(i + 1)
    foo(i + 2)
  }

  void foo(int i) {
  #pragma omp parallel
  ...
  }

The target region can be executed in SPMD mode but we cannot decide that
syntactically when the region is encountered. Agreed?



[SECOND]
Now there are other benefits with regards to the above mentioned state
machine. In the LLVM pass we can analyze the kernel code
interprocedurally and detect all potentially executed parallel regions,
together with a relation between them, and the need for the fallback
case. That means we can build a state machine that __takes control
dependences into account__, __after inlining and dead code elimination__
canonicalized the kernel.

If inlining and code canonicalization resulted in the following
structure, the state machine we can build late can know that after
section0 the workers will execute section1, potentially multiple times,
before they move on to section3. In today's scheme, this is sth. we
cannot simply do, causing us to traverse the if-cascade from top to
bottom all the time (which grows linear with the number of parallel
regions).

  if (...) {
    #pragma omp parallel
    section0(...)
    do {
      #pragma omp parallel
      section1(...)
    } while (...)
  }
  #pragma omp parallel
  section3(...)



[THIRD]
Depending on the hardware, we need to make sure, or at least try rally
hard, that there is no fallback case in the state machine, which is an
indirect function call. This can be done best at link time which
requires us to analyze the kernel late and modify the state machine at
that point anyway.


> From your answer below you mention a lower number of function calls. Since
> today we inline everything anyway how does that help?
If we inline, it doesn't for performance purposes. If we do not inline,
it does. In either case, it helps to simplify middle-end analyses and
transformations that work on kernels. Finally, it prevents us from
wasting compile time looking at the (unoptimizable) state machine of
every target region.

Maybe it is worth asking the opposite question:
  What are the reasons against these general runtime calls that hide the
  complexity we currently emit into the user code module?
[Note that I discuss the only drawback I came up with, a non-customized
state machine, already above.]

> If you haven't considered performance so far how come you're
proposing all
> these changes? What led you to propose all these changes?
See above.

> In SPMD mode all threads execute the same code. Using the notation in the
> schemes above you can depict this as:
> 
>     all threads
> ===================>       RUNNING
> 
> No state machine being used, no disjoints sets of threads. This is as
> if you're executing CUDA code.
Agreed.

> Could you explain what your proposed changes are in this context?
None, at least after inlining the runtime library calls there is
literally the same code executed before and after the changes.

> Could you also explain what you mean by "assuming SPMD wasn't
achieved"?
That is one of the two motivations for the whole change. I explained
that in the initial RFC and again above. The next comment points you to
the code that tries to achieve SPMD mode for inputs that were generated
in the non-SPMD mode (master-worker + state machine) by Clang.

> Do you expect to write another LLVM pass which will transform the
> master-worker scheme + state machine into an SPMD scheme?
I did already, as that was the main motivation for the whole thing.
It is part of the prototype, see [3, line 321].

[3] https://reviews.llvm.org/D57460#change-8gnnGNfJVR4B 


Cheers,
  Johannes

> From:        "Doerfert, Johannes" <jdoerfert at anl.gov>
> To:        Gheorghe-Teod Bercea <Gheorghe-Teod.Bercea at ibm.com>
> Cc:        Alexey Bataev <a.bataev at outlook.com>, Guray Ozen
<gozen at nvidia.com>,
> "Gregory.Rodgers at amd.com" <Gregory.Rodgers at amd.com>,
"Finkel, Hal J."
> <hfinkel at anl.gov>, "kli at ca.ibm.com" <kli at
ca.ibm.com>,
> "openmp-dev at lists.llvm.org" <openmp-dev at
lists.llvm.org>, LLVM-Dev
> <llvm-dev at lists.llvm.org>, "cfe-dev at lists.llvm.org"
<cfe-dev at lists.llvm.org>
> Date:        01/30/2019 07:56 PM
> Subject:        Re: [RFC] Late (OpenMP) GPU code "SPMD-zation"
>
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
> 
> 
> 
> Hi Doru,
> 
> [+ llvm-dev and cfe-dev]
> 
> On 01/30, Gheorghe-Teod Bercea wrote:
> > Hi Johannes,
> >
> > First of all thanks for looking into the matter of improving non-SPMD
mode!
> >
> > I have a question regarding the state machine that you said you'd
like to
> > replace/improve. There are cases (such as target regions that span
multiple
> > compilation units) where the switch statement is required. Is this
something
> > that your changes will touch in any way?
> 
> There will not be a difference. Let me explain in some details as there
> seems to be a lot of confusion on this state machine topic:
> 
> Now:
> 
> Build a state machine in the user code (module) with all the parallel
> regions as explicit targets of the switch statement and a fallback
> default that does a indirect call to the requested parallel region.
> 
> 
> Proposed, after Clang:
> 
> Use the runtime state machine implementation [0] which reduces the
> switch to the default case, thus an indirect call to the requested
> parallel region. This will always work, regardless of the translation
> unit that contained the parallel region (pointer).
> 
> Proposed, after OpenMP-Opt pass in LLVM (assuming SPMD wasn't
achieved):
> 
> All reachable parallel regions in a kernel are collected and used to
> create the switch statement in the user code (module) [1, line 111] with
> a fallback if there are potentially [1, line 212] hidden parallel
> regions.
> 
> 
> Does that make sense?
> 
> 
> [0] https://reviews.llvm.org/D57460#change-e9Ljd9RgdWYz
> [1] https://reviews.llvm.org/D57460#change-8gnnGNfJVR4B
> 
> 
> > My next question is, for the workloads which are in the same
compilation unit
> > there is a trick that code gen performs (or could perform I'm not
sure if
> this
> > has been upstreamed) where it can check for the specific name of an
outlined
> > function and then just call it directly thus making that function
inline-able
> > (thus erasing most if not all the overhead of having the state machine
in the
> > first place). In other words the "worst" part of the switch
statement will
> only
> > apply to outlined functions from other compilation units. With this in
mind
> > what would the impact of your changes be in the end? If this part
isn't clear
> I
> > can do some digging to find out how this actually works in more
details it's
> > been too long since I've had to look at this part.
> 
> See the answer above.
> 
> 
> > Can you share some performance numbers given an example you have been
looking
> > at? I see you have one that uses "#pragma omp atomic". I
would avoid using
> > something like that since it may have other overheads not related to
your
> > changes. I would put together an example with this directive
structure:
> >
> > #pragma omp target teams distribute
> > for(...){
> >   <code1>
> >   #pragma omp parallel for
> >   for(...) {
> >     <code2>
> >   }
> >   <code3>
> > }
> >
> > which forces the use of the master-worker scheme (non-SPMD mode)
without any
> > other distractions.
> 
> The atomic stuff I used to determine correctness. I haven't yet looked
> at performance. I will do so now and inform you on my results.
> 
> 
> > It would then be interesting to understand how you plan to change the
LLVM
> code
> > generated for this,
> 
> The examples show how the LLVM-IR is supposed to look like, right?
> 
> > what the overheads that you're targeting are (register usage,
> > synchronization cost etc), and then what the performance gain is
> > compared to the current scheme.
> 
> I can also compare register usage in addition to performance but there
> is no difference in synchronization. The number and (relative) order of
> original runtime library calls stays the same. The number of user code
> -> runtime library calls is even decreased.
> 
> 
> Please let me know if this helps and what questions remain.
> 
> Thanks,
>  Johannes
> 
> 
> 
> > From:        "Doerfert, Johannes" <jdoerfert at
anl.gov>
> > To:        Alexey Bataev <a.bataev at outlook.com>
> > Cc:        Guray Ozen <gozen at nvidia.com>, Gheorghe-Teod
Bercea
> > <gheorghe-teod.bercea at ibm.com>, "openmp-dev at
lists.llvm.org"
> > <openmp-dev at lists.llvm.org>, "Finkel, Hal J."
<hfinkel at anl.gov>,
> > "Gregory.Rodgers at amd.com" <Gregory.Rodgers at
amd.com>, "kli at ca.ibm.com"
> > <kli at ca.ibm.com>
> > Date:        01/30/2019 04:14 PM
> > Subject:        Re: [RFC] Late (OpenMP) GPU code
"SPMD-zation"
> > 
$B(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,
>
(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,
> (,(,(, (B
> >
> >
> >
> > I don't really see "many ifs and maybes", actually none.
> >
> > Anyway, I will now work on a patch set that adds the new functionality
under
> a
> > cmd flag
> > in order to showcase correctness and performance on real code.
> >
> > If you, or somebody else, have interesting examples, please feel free
to
> point
> > me at them.
> >
> > Thanks,
> >   Johannes
> >
> >
> > 
$B(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,
>
(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,
> (,(,(, (B
> >
> > From: Alexey Bataev <a.bataev at outlook.com>
> > Sent: Wednesday, January 30, 2019 2:18:19 PM
> > To: Doerfert, Johannes
> > Cc: Guray Ozen; Gheorghe-Teod Bercea; openmp-dev at lists.llvm.org;
Finkel, Hal
> > J.; Gregory.Rodgers at amd.com; kli at ca.ibm.com
> > Subject: Re: [RFC] Late (OpenMP) GPU code "SPMD-zation"
> >  
> > Currently, there are too many "if"s and "maybe"s.
If you can provide solution
> > that does not break anything and does not affect the performance, does
not
> > require changes in the backend - then go ahead with the patches.
> >
> > -------------
> > Best regards,
> > Alexey Bataev
> > 30.01.2019 14:49, Doerfert, Johannes      :
> > No, SPMD mode will not be affected at all.
> >
> > The "worse" part is the following:
> >   If we inline runtime library calls before the openmp-opt pass had a
chance
> to
> > look at the code,
> >   we will not have a customized state machine for the __non-SPMD__
case. That
> > is, the if-cascade
> >   checking the work function pointer is not there.
> >
> > Avoiding this potential performance decline is actually very easy.
While we
> do
> > not have the "inline_late" capability,
> > run the openmp-opt pass __before__ the inliner and we will not get
"worse"
> > code. We might however miss out on
> > _new_ non-SPMD -> SPMD transformations.
> >
> >
> > Does that make sense?
> >
> > 
$B(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,
>
(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,
> (,(,(, (B
> >
> > From: Alexey Bataev <a.bataev at outlook.com>
> > Sent: Wednesday, January 30, 2019 1:44:10 PM
> > To: Doerfert, Johannes
> > Cc: Guray Ozen; Gheorghe-Teod Bercea; openmp-dev at lists.llvm.org;
Finkel, Hal
> > J.; Gregory.Rodgers at amd.com; kli at ca.ibm.com
> > Subject: Re: [RFC] Late (OpenMP) GPU code "SPMD-zation"
> >  
> > Any "worse" is not a good idea. We need to avoid it. It
would be good that
> the
> > new code did not affect the performance, especially for SPMD mode (I
think,
> > this "worse" will affect exactly SPMD mode, no?)
> >
> > -------------
> > Best regards,
> > Alexey Bataev
> > 30.01.2019 14:38, Doerfert, Johannes      :
> > The LLVM optimization (openmp-opt), which does non-SPMD -> SPMD and
custom
> > state machine generation, will not fire if
> > the __kernel_general_... calls are "missing". Thus if we
inline "to early",
> we
> > are "stuck" with the non-SPMD choice (not worse than
> > what we have now!) and the default library state machine
("worse" than what
> we
> > have right now). Does that make sense?
> >
> > The second option described what I want to see us do "later"
in order to
> avoid
> > the above scenario and always get both,
> > openmp-opt and inlining of the runtime and work functions.
> >
> >
> > 
$B(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,
>
(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,
> (,(,(, (B
> >
> > From: Alexey Bataev <a.bataev at outlook.com>
> > Sent: Wednesday, January 30, 2019 1:25:42 PM
> > To: Doerfert, Johannes
> > Cc: Guray Ozen; Gheorghe-Teod Bercea; openmp-dev at lists.llvm.org;
Finkel, Hal
> > J.; Gregory.Rodgers at amd.com; kli at ca.ibm.com
> > Subject: Re: [RFC] Late (OpenMP) GPU code "SPMD-zation"
> >  
> > Sorry, did not understand your answer correctly. But you wrote:
> > for now, not doing the optimization is just fine.
> > What do you mean?
> >
> > -------------
> > Best regards,
> > Alexey Bataev
> > 30.01.2019 14:23, Doerfert, Johannes      :
> > Alexey,
> >
> > I'm not sure how to interpret "Bad idea!". but I think
there is again a
> > misunderstanding.
> > To help me understand, could you try to elaborate a bit?
> >
> > To make my last email clear:
> > I __do__ want inlining. Both answers to your earlier inlining
questions do
> > actually assume the runtime library calls __are eventually inlined__,
> > that is why I mentioned LTO and the runtime as bitcode.
> > .
> > Cheers,
> >   Johannes
> >
> >
> >
> > 
$B(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,
>
(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,
> (,(,(, (B
> >
> > From: Alexey Bataev <a.bataev at outlook.com>
> > Sent: Wednesday, January 30, 2019 1:14:56 PM
> > To: Doerfert, Johannes
> > Cc: Guray Ozen; Gheorghe-Teod Bercea; openmp-dev at lists.llvm.org;
Finkel, Hal
> > J.; Gregory.Rodgers at amd.com; kli at ca.ibm.com
> > Subject: Re: [RFC] Late (OpenMP) GPU code "SPMD-zation"
> >  
> > Bad idea!
> >
> > -------------
> > Best regards,
> > Alexey Bataev
> > 30.01.2019 14:11, Doerfert, Johannes      :
> > Sure I do. Why do you think I don't?
> >
> > 
$B(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,
>
(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,
> (,(,(, (B
> >
> > From: Alexey Bataev <a.bataev at outlook.com>
> > Sent: Wednesday, January 30, 2019 1:00:59 PM
> > To: Doerfert, Johannes
> > Cc: Guray Ozen; Gheorghe-Teod Bercea; openmp-dev at lists.llvm.org;
Finkel, Hal
> > J.; Gregory.Rodgers at amd.com; kli at ca.ibm.com
> > Subject: Re: [RFC] Late (OpenMP) GPU code "SPMD-zation"
> >  
> > You don't want to do the inlining?
> >
> > -------------
> > Best regards,
> > Alexey Bataev
> > 30.01.2019 13:59, Doerfert, Johannes      :
> > - for now, not doing the optimization is just fine. The whole idea is
that
> code
> > is always valid.
> >
> >
> 
> --
> 
> Johannes Doerfert
> Researcher
> 
> Argonne National Laboratory
> Lemont, IL 60439, USA
> 
> jdoerfert at anl.gov
> [attachment "signature.asc" deleted by Gheorghe-Teod
Bercea/US/IBM]
> 
> 
-- 

Johannes Doerfert
Researcher

Argonne National Laboratory
Lemont, IL 60439, USA

jdoerfert at anl.gov
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 228 bytes
Desc: not available
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20190131/c67c05cb/attachment.sig>

Gheorghe-Teod Bercea via llvm-dev

2019-Feb-06 15:49 UTC

head link

[llvm-dev] [RFC] Late (OpenMP) GPU code "SPMD-zation"

<font size=2 face="sans-serif">Hi
Johannes,</font><br><br><font size=2
face="sans-serif">Your clarifications helped a lot, having
all details gathered in one place helped me understand better what you
are proposing.</font><br><br><font size=2
face="sans-serif">Thanks a lot for taking the time to
explain.</font><br><br><font size=2
face="sans-serif">Thanks,</font><br><br><font
size=2
face="sans-serif">--Doru<br></font><br><br><br><br><font
size=1 color=#5f5f5f face="sans-serif">From:
 </font><font size=1 face="sans-serif">"Doerfert,
Johannes"
<jdoerfert@anl.gov></font><br><font size=1 color=#5f5f5f
face="sans-serif">To:
 </font><font size=1 face="sans-serif">Gheorghe-Teod
Bercea
<Gheorghe-Teod.Bercea@ibm.com></font><br><font size=1
color=#5f5f5f face="sans-serif">Cc:
 </font><font size=1 face="sans-serif">Alexey Bataev
<a.bataev@outlook.com>,
"cfe-dev@lists.llvm.org" <cfe-dev@lists.llvm.org>, Guray
Ozen <gozen@nvidia.com>, "Gregory.Rodgers@amd.com"
<Gregory.Rodgers@amd.com>,
"Finkel, Hal J." <hfinkel@anl.gov>, "kli@ca.ibm.com"
<kli@ca.ibm.com>, LLVM-Dev <llvm-dev@lists.llvm.org>,
"openmp-dev@lists.llvm.org"
<openmp-dev@lists.llvm.org></font><br><font size=1
color=#5f5f5f face="sans-serif">Date:
 </font><font size=1 face="sans-serif">01/31/2019 12:34
PM</font><br><font size=1 color=#5f5f5f
face="sans-serif">Subject:
   </font><font size=1 face="sans-serif">Re: [RFC] Late
(OpenMP) GPU code "SPMD-zation"</font><br><hr
noshade><br><br><br><tt><font size=2>Hi
Doru,<br><br>maybe I should clarify something I mentioned in an
earlier email already<br>but it seems there are things getting lost in
this thread:<br><br>  While the prototype replaces code generation
parts in Clang, the<br>  actual patches will add alternative code
generation paths, guarded<br>  under a cmd flag. Once, and obviously only
if, everything is in
place<br>  and has been shown to improve the current situation, the
default
path<br>  would be switched.<br><br><br>On 01/31,
Gheorghe-Teod Bercea wrote:<br>> Hi Johannes,<br>>
<br>> Thank you for the explanation.<br>> <br>> I
think we need to clarify some details about code generation in Clang
today:<br><br>I'm not really sure why you feel the need to do
that but OK.<br><br><br>> 1. non-SPMD mode, or generic
mode, uses the master-worker code gen
scheme where<br>> the master thread and the worker threads are disjoint
sets of threads
(when one<br>> set runs the other set is blocked and doesn't
participate in the execution):<br>> <br>> workers  | 
master<br>> ====================<br>> BLOCKED  |
RUNNING<br>> ------- sync -------<br>> RUNNING  |
BLOCKED<br>> ------- sync -------<br>> BLOCKED  |
RUNNING<br><br>I agree, and for the record, this is not changed by
my prototype, see<br>[1, line 295].<br><br>[1]
</font></tt><a
href="https://reviews.llvm.org/D57460#change-e9Ljd9RgdWYz"><tt><font
size=2>https://reviews.llvm.org/D57460#change-e9Ljd9RgdWYz</font></tt></a><tt><font
size=2><br><br><br>> 2. the worker threads, in their
RUNNING state above, contain a state
machine<br>> which chooses the parallel region to be executed. Today
this choice
happens in<br>> one of two ways: explicit targets (where you know what
outlined region
you are<br>> calling and you just call it) and indirect targets (via
function pointer
set by<br>> master thread in one of its RUNNING regions):<br>>
<br>> workers  |  master<br>>
====================<br>> BLOCKED  | RUNNING<br>> ------- sync
-------<br>> RUNNING  |<br>>  state   | BLOCKED<br>>
machine  |<br>> ------- sync -------<br>> BLOCKED  |
RUNNING<br><br>Partially agreed. Afaik, it will always be decided
through a function<br>pointer set by the master thread and communicated to
the workers through<br>the runtime. The workers use a switch, or in fact
an if-cascade, to<br>check if the function pointer points to a known
parallel region. If so<br>it will be called directly, otherwise there is
the fallback indirect<br>call of the function
pointer.<br><br>> Your intended changes (only target the RUNNING
state machine of the
WORKERS):<br>> - remove explicit targets from current code gen. (by
itself this is
a major<br>> step back!!)<br>> - introduce a pass in LLVM
which will add back the explicit targets.<br><br>Simplified but
correct. From my perspective this is not a problem<br>because in
production I will always run the LLVM passes after Clang.<br>Even if you
do not run the LLVM passes, the below reasoning might be<br>enough to
convince people to run a similar pass in their respective<br>pipeline. If
that is not enough, we can also keep the Clang state<br>machine generation
around (see the top comment).<br><br><br>> Can you point
out any major improvements this will bring compared
to the<br>> current state?<br><br>Sure, I'll give you
three for now:<br><br>[FIRST]<br>Here is the original
motivation from the first RFC mail (in case you<br>have missed
it):<br><br> 2) Implement a middle-end LLVM-IR pass that detects the
guarded mode,<br>    e.g., through the runtime library calls used, and
that tries
to<br>    convert it into the SPMD mode potentially by introducing
lightweight<br>    guards in the process.<br><br>   
Why:<br>    - After the inliner, the canonicalizations, dead code
elimination,<br>      code movement [2, Section 7 on page 8], we have a
clearer picture<br>      of the code that is actually executed in the
target
region and all<br>      the side effects it contains. Thus, we can make an
educated<br>      decision on the required amount of guards that prevent
unwanted<br>      side effects from happening after a move to SPMD
mode.<br>    - At this point we can more easily introduce different
schemes
to<br>      avoid side effects by threads that were not supposed
to run. We<br>      can decide if a state machine is needed, conditionals
should be<br>      employed, masked instructions are appropriate, or
"dummy" local<br>      storage can be used to hide the side
effect from the
outside<br>      world.<br><br>[2]
</font></tt><a
href="http://compilers.cs.uni-saarland.de/people/doerfert/par_opt18.pdf"><tt><font
size=2>http://compilers.cs.uni-saarland.de/people/doerfert/par_opt18.pdf</font></tt></a><tt><font
size=2><br><br><br>Let me give you the canonical example
that shows the need for this:<br><br>  #pragma omp target
teams<br>  {<br>    foo(i + 0)<br>    foo(i + 1)<br>   
foo(i + 2)<br>  }<br><br>  void foo(int i) {<br> 
#pragma omp parallel<br>  ...<br>  }<br><br>The target
region can be executed in SPMD mode but we cannot decide
that<br>syntactically when the region is encountered.
Agreed?<br><br><br><br>[SECOND]<br>Now there are
other benefits with regards to the above mentioned state<br>machine. In
the LLVM pass we can analyze the kernel code<br>interprocedurally and
detect all potentially executed parallel regions,<br>together with a
relation between them, and the need for the fallback<br>case. That means
we can build a state machine that __takes control<br>dependences into
account__, __after inlining and dead code elimination__<br>canonicalized
the kernel.<br><br>If inlining and code canonicalization resulted in
the following<br>structure, the state machine we can build late can know
that after<br>section0 the workers will execute section1, potentially
multiple times,<br>before they move on to section3. In today's scheme,
this is sth. we<br>cannot simply do, causing us to traverse the if-cascade
from top to<br>bottom all the time (which grows linear with the number of
parallel<br>regions).<br><br>  if (...) {<br>    #pragma
omp parallel<br>    section0(...)<br>    do {<br>      #pragma
omp parallel<br>      section1(...)<br>    } while (...)<br> 
}<br>  #pragma omp parallel<br> 
section3(...)<br><br><br><br>[THIRD]<br>Depending
on the hardware, we need to make sure, or at least try rally<br>hard, that
there is no fallback case in the state machine, which is an<br>indirect
function call. This can be done best at link time which<br>requires us to
analyze the kernel late and modify the state machine at<br>that point
anyway.<br><br><br><br>> From your answer below you
mention a lower number of function calls.
Since<br>> today we inline everything anyway how does that
help?<br><br>If we inline, it doesn't for performance purposes.
If we do not inline,<br>it does. In either case, it helps to simplify
middle-end analyses and<br>transformations that work on kernels. Finally,
it prevents us from<br>wasting compile time looking at the (unoptimizable)
state machine of<br>every target region.<br><br>Maybe it is
worth asking the opposite question:<br>  What are the reasons against
these general runtime calls that hide
the<br>  complexity we currently emit into the user code
module?<br>[Note that I discuss the only drawback I came up with, a
non-customized<br>state machine, already
above.]<br><br><br>> If you haven't considered
performance so far how come you're proposing
all<br>> these changes? What led you to propose all these
changes?<br><br>See above.<br><br><br>> In SPMD
mode all threads execute the same code. Using the notation
in the<br>> schemes above you can depict this as:<br>>
<br>>     all threads<br>> ====================<br>> 
RUNNING<br>> <br>> No state machine being used, no disjoints
sets of threads. This is
as<br>> if you're executing CUDA
code.<br><br>Agreed.<br><br><br>> Could you
explain what your proposed changes are in this context?<br><br>None,
at least after inlining the runtime library calls there is<br>literally
the same code executed before and after the
changes.<br><br><br>> Could you also explain what you mean
by "assuming SPMD wasn't
achieved"?<br><br>That is one of the two motivations for the
whole change. I explained<br>that in the initial RFC and again above. The
next comment points you to<br>the code that tries to achieve SPMD mode for
inputs that were generated<br>in the non-SPMD mode (master-worker + state
machine) by Clang.<br><br><br>> Do you expect to write
another LLVM pass which will transform the<br>> master-worker scheme +
state machine into an SPMD scheme?<br><br>I did already, as that was
the main motivation for the whole thing.<br>It is part of the prototype,
see [3, line 321].<br><br>[3] </font></tt><a
href="https://reviews.llvm.org/D57460#change-8gnnGNfJVR4B"><tt><font
size=2>https://reviews.llvm.org/D57460#change-8gnnGNfJVR4B</font></tt></a><tt><font
size=2><br><br><br>Cheers,<br> 
Johannes<br><br><br>> From:        "Doerfert,
Johannes" <jdoerfert@anl.gov><br>> To:        Gheorghe-Teod
Bercea <Gheorghe-Teod.Bercea@ibm.com><br>> Cc:        Alexey
Bataev <a.bataev@outlook.com>,
Guray Ozen <gozen@nvidia.com>,<br>>
"Gregory.Rodgers@amd.com" <Gregory.Rodgers@amd.com>,
"Finkel, Hal J."<br>> <hfinkel@anl.gov>,
"kli@ca.ibm.com" <kli@ca.ibm.com>,<br>>
"openmp-dev@lists.llvm.org" <openmp-dev@lists.llvm.org>,
LLVM-Dev<br>> <llvm-dev@lists.llvm.org>,
"cfe-dev@lists.llvm.org"
<cfe-dev@lists.llvm.org><br>> Date:        01/30/2019 07:56
PM<br>> Subject:        Re: [RFC] Late (OpenMP) GPU code
"SPMD-zation"<br>>
$B(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(B<br>>
<br>> <br>> <br>> Hi Doru,<br>>
<br>> [+ llvm-dev and cfe-dev]<br>> <br>> On 01/30,
Gheorghe-Teod Bercea wrote:<br>> > Hi Johannes,<br>>
><br>> > First of all thanks for looking into the matter of
improving
non-SPMD mode!<br>> ><br>> > I have a question regarding
the state machine that you said you'd
like to<br>> > replace/improve. There are cases (such as target
regions that
span multiple<br>> > compilation units) where the switch statement
is required. Is
this something<br>> > that your changes will touch in any
way?<br>> <br>> There will not be a difference. Let me explain
in some details as
there<br>> seems to be a lot of confusion on this state machine
topic:<br>> <br>> Now:<br>> <br>> Build a
state machine in the user code (module) with all the parallel<br>>
regions as explicit targets of the switch statement and a fallback<br>>
default that does a indirect call to the requested parallel
region.<br>> <br>> <br>> Proposed, after
Clang:<br>> <br>> Use the runtime state machine implementation
[0] which reduces the<br>> switch to the default case, thus an indirect
call to the requested<br>> parallel region. This will always work,
regardless of the translation<br>> unit that contained the parallel
region (pointer).<br>> <br>> Proposed, after OpenMP-Opt pass
in LLVM (assuming SPMD wasn't achieved):<br>> <br>> All
reachable parallel regions in a kernel are collected and used
to<br>> create the switch statement in the user code (module) [1, line
111]
with<br>> a fallback if there are potentially [1, line 212] hidden
parallel<br>> regions.<br>> <br>> <br>> Does
that make sense?<br>> <br>> <br>> [0]
</font></tt><a
href="https://reviews.llvm.org/D57460#change-e9Ljd9RgdWYz"><tt><font
size=2>https://reviews.llvm.org/D57460#change-e9Ljd9RgdWYz</font></tt></a><tt><font
size=2><br>> [1] </font></tt><a
href="https://reviews.llvm.org/D57460#change-8gnnGNfJVR4B"><tt><font
size=2>https://reviews.llvm.org/D57460#change-8gnnGNfJVR4B</font></tt></a><tt><font
size=2><br>> <br>> <br>> > My next question is,
for the workloads which are in the same
compilation unit<br>> > there is a trick that code gen performs (or
could perform I'm
not sure if<br>> this<br>> > has been upstreamed) where it
can check for the specific name
of an outlined<br>> > function and then just call it directly thus
making that function
inline-able<br>> > (thus erasing most if not all the overhead of
having the state
machine in the<br>> > first place). In other words the
"worst" part of the
switch statement will<br>> only<br>> > apply to outlined
functions from other compilation units. With
this in mind<br>> > what would the impact of your changes be in the
end? If this
part isn't clear<br>> I<br>> > can do some digging to
find out how this actually works in more
details it's<br>> > been too long since I've had to look at
this part.<br>> <br>> See the answer above.<br>>
<br>> <br>> > Can you share some performance numbers given
an example you have
been looking<br>> > at? I see you have one that uses "#pragma
omp atomic".
I would avoid using<br>> > something like that since it may have
other overheads not related
to your<br>> > changes. I would put together an example with this
directive
structure:<br>> ><br>> > #pragma omp target teams
distribute<br>> > for(...){<br>> >  
<code1><br>> >   #pragma omp parallel for<br>> >  
for(...) {<br>> >     <code2><br>> >  
}<br>> >   <code3><br>> > }<br>>
><br>> > which forces the use of the master-worker scheme
(non-SPMD mode)
without any<br>> > other distractions.<br>> <br>>
The atomic stuff I used to determine correctness. I haven't yet
looked<br>> at performance. I will do so now and inform you on my
results.<br>> <br>> <br>> > It would then be
interesting to understand how you plan to change
the LLVM<br>> code<br>> > generated for this,<br>>
<br>> The examples show how the LLVM-IR is supposed to look like,
right?<br>> <br>> > what the overheads that you're
targeting are (register usage,<br>> > synchronization cost etc), and
then what the performance gain
is<br>> > compared to the current scheme.<br>>
<br>> I can also compare register usage in addition to performance but
there<br>> is no difference in synchronization. The number and
(relative) order
of<br>> original runtime library calls stays the same. The number of
user
code<br>> -> runtime library calls is even decreased.<br>>
<br>> <br>> Please let me know if this helps and what
questions remain.<br>> <br>> Thanks,<br>> 
Johannes<br>> <br>> <br>> <br>> > From:  
"Doerfert, Johannes"
<jdoerfert@anl.gov><br>> > To:        Alexey Bataev
<a.bataev@outlook.com><br>> > Cc:        Guray Ozen
<gozen@nvidia.com>,
Gheorghe-Teod Bercea<br>> > <gheorghe-teod.bercea@ibm.com>,
"openmp-dev@lists.llvm.org"<br>> >
<openmp-dev@lists.llvm.org>, "Finkel, Hal J."
<hfinkel@anl.gov>,<br>> > "Gregory.Rodgers@amd.com"
<Gregory.Rodgers@amd.com>,
"kli@ca.ibm.com"<br>> >
<kli@ca.ibm.com><br>> > Date:        01/30/2019 04:14
PM<br>> > Subject:        Re: [RFC] Late (OpenMP) GPU
code "SPMD-zation"<br>> > 
$B(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,<br>>
(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,<br>>
(,(,(, (B<br>> ><br>> ><br>> ><br>>
> I don't really see "many ifs and maybes", actually
none.<br>> ><br>> > Anyway, I will now work on a patch
set that adds the new functionality
under<br>> a<br>> > cmd flag<br>> > in order to
showcase correctness and performance on real code.<br>>
><br>> > If you, or somebody else, have interesting examples,
please feel
free to<br>> point<br>> > me at them.<br>>
><br>> > Thanks,<br>> >   Johannes<br>>
><br>> ><br>> > 
$B(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,<br>>
(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,<br>>
(,(,(, (B<br>> ><br>> > From: Alexey Bataev
<a.bataev@outlook.com><br>> > Sent: Wednesday, January 30,
2019 2:18:19 PM<br>> > To: Doerfert, Johannes<br>> > Cc:
Guray Ozen; Gheorghe-Teod Bercea; openmp-dev@lists.llvm.org;
Finkel, Hal<br>> > J.; Gregory.Rodgers@amd.com;
kli@ca.ibm.com<br>> > Subject: Re: [RFC] Late (OpenMP) GPU code
"SPMD-zation"<br>> >  <br>> > Currently, there
are too many "if"s and "maybe"s.
If you can provide solution<br>> > that does not break anything and
does not affect the performance,
does not<br>> > require changes in the backend - then go ahead with
the patches.<br>> ><br>> > -------------<br>>
> Best regards,<br>> > Alexey Bataev<br>> >
30.01.2019 14:49, Doerfert, Johannes      :<br>> > No, SPMD mode
will not be affected at all.<br>> ><br>> > The
"worse" part is the following:<br>> >   If we inline
runtime library calls before the openmp-opt
pass had a chance<br>> to<br>> > look at the
code,<br>> >   we will not have a customized state machine for the
__non-SPMD__
case. That<br>> > is, the if-cascade<br>> >   checking
the work function pointer is not there.<br>> ><br>> >
Avoiding this potential performance decline is actually very
easy. While we<br>> do<br>> > not have the
"inline_late" capability,<br>> > run the openmp-opt pass
__before__ the inliner and we will not
get "worse"<br>> > code. We might however miss out
on<br>> > _new_ non-SPMD -> SPMD transformations.<br>>
><br>> ><br>> > Does that make sense?<br>>
><br>> > 
$B(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,<br>>
(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,<br>>
(,(,(, (B<br>> ><br>> > From: Alexey Bataev
<a.bataev@outlook.com><br>> > Sent: Wednesday, January 30,
2019 1:44:10 PM<br>> > To: Doerfert, Johannes<br>> > Cc:
Guray Ozen; Gheorghe-Teod Bercea; openmp-dev@lists.llvm.org;
Finkel, Hal<br>> > J.; Gregory.Rodgers@amd.com;
kli@ca.ibm.com<br>> > Subject: Re: [RFC] Late (OpenMP) GPU code
"SPMD-zation"<br>> >  <br>> > Any
"worse" is not a good idea. We need to avoid it.
It would be good that<br>> the<br>> > new code did not
affect the performance, especially for SPMD
mode (I think,<br>> > this "worse" will affect exactly
SPMD mode, no?)<br>> ><br>> >
-------------<br>> > Best regards,<br>> > Alexey
Bataev<br>> > 30.01.2019 14:38, Doerfert, Johannes     
:<br>> > The LLVM optimization (openmp-opt), which does non-SPMD
->
SPMD and custom<br>> > state machine generation, will not fire
if<br>> > the __kernel_general_... calls are "missing".
Thus
if we inline "to early",<br>> we<br>> > are
"stuck" with the non-SPMD choice (not worse than<br>> >
what we have now!) and the default library state machine ("worse"
than what<br>> we<br>> > have right now). Does that make
sense?<br>> ><br>> > The second option described what I
want to see us do "later"
in order to<br>> avoid<br>> > the above scenario and always
get both,<br>> > openmp-opt and inlining of the runtime and work
functions.<br>> ><br>> ><br>> > 
$B(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,<br>>
(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,<br>>
(,(,(, (B<br>> ><br>> > From: Alexey Bataev
<a.bataev@outlook.com><br>> > Sent: Wednesday, January 30,
2019 1:25:42 PM<br>> > To: Doerfert, Johannes<br>> > Cc:
Guray Ozen; Gheorghe-Teod Bercea; openmp-dev@lists.llvm.org;
Finkel, Hal<br>> > J.; Gregory.Rodgers@amd.com;
kli@ca.ibm.com<br>> > Subject: Re: [RFC] Late (OpenMP) GPU code
"SPMD-zation"<br>> >  <br>> > Sorry, did not
understand your answer correctly. But you wrote:<br>> > for now, not
doing the optimization is just fine.<br>> > What do you
mean?<br>> ><br>> > -------------<br>> >
Best regards,<br>> > Alexey Bataev<br>> > 30.01.2019
14:23, Doerfert, Johannes      :<br>> > Alexey,<br>>
><br>> > I'm not sure how to interpret "Bad idea!".
but I think
there is again a<br>> > misunderstanding.<br>> > To help
me understand, could you try to elaborate a bit?<br>>
><br>> > To make my last email clear:<br>> > I __do__
want inlining. Both answers to your earlier inlining
questions do<br>> > actually assume the runtime library calls __are
eventually inlined__,<br>> > that is why I mentioned LTO and the
runtime as bitcode.<br>> > .<br>> >
Cheers,<br>> >   Johannes<br>> ><br>>
><br>> ><br>> > 
$B(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,<br>>
(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,<br>>
(,(,(, (B<br>> ><br>> > From: Alexey Bataev
<a.bataev@outlook.com><br>> > Sent: Wednesday, January 30,
2019 1:14:56 PM<br>> > To: Doerfert, Johannes<br>> > Cc:
Guray Ozen; Gheorghe-Teod Bercea; openmp-dev@lists.llvm.org;
Finkel, Hal<br>> > J.; Gregory.Rodgers@amd.com;
kli@ca.ibm.com<br>> > Subject: Re: [RFC] Late (OpenMP) GPU code
"SPMD-zation"<br>> >  <br>> > Bad
idea!<br>> ><br>> > -------------<br>> >
Best regards,<br>> > Alexey Bataev<br>> > 30.01.2019
14:11, Doerfert, Johannes      :<br>> > Sure I do. Why do you think
I don't?<br>> ><br>> > 
$B(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,<br>>
(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,<br>>
(,(,(, (B<br>> ><br>> > From: Alexey Bataev
<a.bataev@outlook.com><br>> > Sent: Wednesday, January 30,
2019 1:00:59 PM<br>> > To: Doerfert, Johannes<br>> > Cc:
Guray Ozen; Gheorghe-Teod Bercea; openmp-dev@lists.llvm.org;
Finkel, Hal<br>> > J.; Gregory.Rodgers@amd.com;
kli@ca.ibm.com<br>> > Subject: Re: [RFC] Late (OpenMP) GPU code
"SPMD-zation"<br>> >  <br>> > You don't
want to do the inlining?<br>> ><br>> >
-------------<br>> > Best regards,<br>> > Alexey
Bataev<br>> > 30.01.2019 13:59, Doerfert, Johannes     
:<br>> > - for now, not doing the optimization is just fine. The
whole
idea is that<br>> code<br>> > is always
valid.<br>> ><br>> ><br>> <br>>
--<br>> <br>> Johannes Doerfert<br>>
Researcher<br>> <br>> Argonne National
Laboratory<br>> Lemont, IL 60439, USA<br>> <br>>
jdoerfert@anl.gov<br>> [attachment "signature.asc" deleted by
Gheorghe-Teod Bercea/US/IBM]<br>> <br>> <br><br>--
<br><br>Johannes
Doerfert<br>Researcher<br><br>Argonne National
Laboratory<br>Lemont, IL 60439,
USA<br><br>jdoerfert@anl.gov<br>[attachment
"signature.asc" deleted by Gheorghe-Teod Bercea/US/IBM]
</font></tt><br><br><BR>

llvm dev - Jan 2019 - [RFC] Late (OpenMP) GPU code "SPMD-zation"

[llvm-dev] [RFC] Late (OpenMP) GPU code "SPMD-zation"

[llvm-dev] [RFC] Late (OpenMP) GPU code "SPMD-zation"

[llvm-dev] [RFC] Late (OpenMP) GPU code "SPMD-zation"