thr3ads.net - llvm dev - [llvm-dev] [RFC] IR-level Region Annotations [Jan 2017]

If this information is useful, please help other people find it:
Share via:

Adve, Vikram Sadanand via llvm-dev

2017-Jan-19 19:36 UTC

[llvm-dev] [RFC] IR-level Region Annotations

Hi Johannes,
> I am especially curious where you get your data from. Tapir [0] (and to
> some degree PIR [1]) have shown that, counterintuitively, only a few
changes
> to LLVM passes are needed. Tapir was recently used in an MIT class with a
> lot of students and it seemed to work well with only minimal changes
> to analysis and especially transformation passes.
TAPIR is an elegant, small extension and, in particular, I think the idea of
asymmetric parallel tasks and control flow is a clever way to express
parallelism with serial semantics, as in Cilk.  Encoding the control flow
extensions as explicit instructions is orthogonal to that, though arguably more
elegant than using region tags + metadata.

However, Cilk is a tiny language compared with the full complexity of other
languages, like OpenMP.  To take just one example, TAPIR cannot express the
ORDERED construct of OpenMP.  A more serious concern, IMO, is that TAPIR (like
Cilk) requires serial semantics, whereas there are many parallel languages,
OpenMP included, that do not obey that restriction.  Third, OpenMP has
*numerous* clauses, e.g., REDUCTION or PRIVATE,  that are needed because without
that, you’d be dependent on fundamentally hard compiler analyses to extract the
same information for satisfactory parallel performance; realistic applications
cannot depend on the success of such analyses. For example, automatic array
privatization in parallel loops is a very hard problem (automatic scalar
privatization is easier, but even that is interprocedural).  Reduction
recognition is doable for common cases, but there are hard cases here as well. 
These are all standard features of parallel programs, not specific to OpenMP
(e.g., C++17 parallel template operators are likely to produce these as well).

If you support all these capabilities in the IR, a *lot* more than 6000 LOC
(Tapir’s statistic; I’m sorry I don’t recall the number for PIR) would probably
have to be modified in LLVM.

>> [...]
>> (b) Add several new LLVM instructions such as, for parallelism, fork,
spawn,
>>    join, barrier, etc.
>> [...]
> 
> For me fork and spawn are serving the same purpose, most new schemes
suggested three new instructions in total.
A reasonable question is whether to use (#b) first-class instructions for some
features, *in combination with* (#d) — i.e., region markers + metadata — or to
use #d exclusively.  There are almost certainly far too many essential features
in parallel programs to capture them all as new instructions. I don’t see a need
to answer this question on Day 1.  Instead, we can begin with regions and
metadata annotations, and then “promote” a few features to first-class
instructions if the benefit is justified.

Does that make sense?

> Also I think we already have a lot of
> simple constructs in the IR to express high-level information properly,
> e.g. 'atomicrmw' instructions for high-level reduction. While we
currently
> lack analysis passes to extract information from such low-level
> representations, these are certainly possible [2,3]. I would argue that
such
> analysis are a better way to do things than placing "high-level
intrinsics" in the IR to mark things like reductions.
IMO, this is too fragile for ensuring performance for realistic applications.
While there is a lot of work on reduction recognition in the literature going
back to the 1980s, there will always be cases where these algorithms miss a
reduction pattern and you get terrible parallel performance. Second, I’m not
convinced that you can use specific operations like “atomicrmw” to find
candidate loop nests in the first place that *might* be reductions, because the
atomic op may in fact be in a library operation linked in as binary code. Third,
this greatly increases the complexity of a “first working prototype” for a
language like OpenMP because you can’t exclude REDUCTION clauses.  Fourth, I
believe array privatization is even harder; more generally, you cannot rely on
heroic compiler optimizations to take the place of explicit language directives.

I may be showing my age, but I co-wrote a full HPF-to-MPI source-to-source
compiler in the late 1990s at Rice University with some quite sophisticated
optimizations [1,2], and we and others in the HPF compiler community — both
product and research teams — encountered many cases where compiler analyses were
inadequate. There is a credible view that HPF failed despite very high hopes and
significant industry and research investment because it relied too much on
heroic compiler technology.

[1] http://vadve.web.engr.illinois.edu/Papers/set_framework.pldi98.ps (PLDI
1998)
[2]
http://vadve.web.engr.illinois.edu/Papers/SC98.NASBenchmarksStudy/dhpfsc98.pdf
(SC 1998)

-—Vikram

// Vikram S. Adve
// Professor, Department of Computer Science
// University of Illinois at Urbana-Champaign
// vadve at illinois.edu
// http://llvm.org


> On Jan 19, 2017, at 11:20 AM, via llvm-dev <llvm-dev at
lists.llvm.org> wrote:
> 
> Date: Thu, 19 Jan 2017 16:30:51 +0100
> From: Johannes Doerfert via llvm-dev <llvm-dev at lists.llvm.org>
> To: Hal Finkel <hfinkel at anl.gov>
> Cc: llvm-dev <llvm-dev at lists.llvm.org>, Simon Moll
> 	<moll at cs.uni-saarland.de>
> Subject: Re: [llvm-dev] [RFC] IR-level Region Annotations
> Message-ID: <20170119153050.GB26909 at arch-linux-jd.home>
> Content-Type: text/plain; charset="utf-8"
> 
> Hi Hal, Hi Xinmin,
> 
> First let me thank you for pushing in this direction, it means more
> people are interested in some kind of change here.
> 
> While "our" RFC will be sent out next week I want to comment on a
specific
> point of this one right now:
> 
>> [...]
>> (b) Add several new LLVM instructions such as, for parallelism, fork,
spawn,
>>    join, barrier, etc.
>> [...]
> 
> For me fork and spawn are serving the same purpose, most new schemes
suggested three new instructions in total.
> 
>> Options |         Pros         | Cons
>> [...]
>> (b)     | Parallelism becomes  | Huge effort for extending all LLVM
passes and
>>        | first class citizen  | code generation to support new
instructions.
>>        |                      | A large set of information still needs
to be
>>        |                      | represented using other means.
>> [...]
> 
> I am especially curious where you get your data from. Tapir [0] (and to
> some degree PIR [1]) have shown that, counterintuitively, only a few
changes
> to LLVM passes are needed. Tapir was recently used in an MIT class with a
> lot of students and it seemed to work well with only minimal changes
> to analysis and especially transformation passes.
> 
> Also the "code generation" issues you mention do, in my opinion,
apply to
> _all_ proposed schemes as all have to be lowered to either sequential
> code or parallel library runtime calls eventually. The sequentialization
> for our new Tapir/PIR hybrid has less than 50 lines and generating
> parallel runtime calls will probably be similar in all schemes anyway.
> 
> Regarding the last point, "A large set of information still needs to
be
> represented using other means", I am curious why this is a bad thing.
I
> think IR should be simple, each instruction/intrinsic etc. should have
> clear and minimal semantics. Also I think we already have a lot of
> simple constructs in the IR to express high-level information properly,
> e.g. 'atomicrmw' instructions for high-level reduction. While we
currently
> lack analysis passes to extract information from such low-level
> representations, these are certainly possible [2,3]. I would argue that
such
> analysis are a better way to do things than placing "high-level
> intrinsics" in the IR to mark things like reductions.
> 
> Cheers,
>  Johannes, on behalf of the Tapir and PIR team
> 
> [0]
https://cpc2016.infor.uva.es/wp-content/uploads/2016/06/CPC2016_paper_12.pdf
> [1] http://compilers.cs.uni-saarland.de/people/doerfert/parallelcfg.pdf
> [2] Section 3 in https://arxiv.org/pdf/1505.07716
> [3] Section 3.2 and 3.3 in
https://www.st.cs.uni-saarland.de/publications/files/streit-taco-2015.pdf

Mehdi Amini via llvm-dev

2017-Jan-19 19:46 UTC

head link

[llvm-dev] [RFC] IR-level Region Annotations

> On Jan 19, 2017, at 11:36 AM, Adve, Vikram Sadanand via llvm-dev
<llvm-dev at lists.llvm.org> wrote:
> 
> Hi Johannes,
> 
>> I am especially curious where you get your data from. Tapir [0] (and to
>> some degree PIR [1]) have shown that, counterintuitively, only a few
changes
>> to LLVM passes are needed. Tapir was recently used in an MIT class with
a
>> lot of students and it seemed to work well with only minimal changes
>> to analysis and especially transformation passes.
> 
> TAPIR is an elegant, small extension and, in particular, I think the idea
of asymmetric parallel tasks and control flow is a clever way to express
parallelism with serial semantics, as in Cilk.  Encoding the control flow
extensions as explicit instructions is orthogonal to that, though arguably more
elegant than using region tags + metadata.
> 
> However, Cilk is a tiny language compared with the full complexity of other
languages, like OpenMP.  To take just one example, TAPIR cannot express the
ORDERED construct of OpenMP.  A more serious concern, IMO, is that TAPIR (like
Cilk) requires serial semantics, whereas there are many parallel languages,
OpenMP included, that do not obey that restriction.  Third, OpenMP has
*numerous* clauses, e.g., REDUCTION or PRIVATE,  that are needed because without
that, you’d be dependent on fundamentally hard compiler analyses to extract the
same information for satisfactory parallel performance; realistic applications
cannot depend on the success of such analyses.
I agree with this, but I’m also wondering if it needs to be first class in the
IR?
For example we know our alias analysis is very basic, and C/C++ have a higher
constraint thanks to their type system, but we didn’t inject this higher level
information that helps the optimizer as first class IR constructs.
I wonder if the same wouldn’t apply to an openmp reduction clause for instance,
where you could use the “basic” IR construct and the analysis would use a
metadata emitted by the frontend instead of trying to infer the reduction.

Just a thought, I have given much time studying other constructs and how they
map to the IR :)

> For example, automatic array privatization in parallel loops is a very hard
problem (automatic scalar privatization is easier, but even that is
interprocedural).  Reduction recognition is doable for common cases, but there
are hard cases here as well.  These are all standard features of parallel
programs, not specific to OpenMP (e.g., C++17 parallel template operators are
likely to produce these as well).
> 
> If you support all these capabilities in the IR, a *lot* more than 6000 LOC
(Tapir’s statistic; I’m sorry I don’t recall the number for PIR) would probably
have to be modified in LLVM.
> 
> 
>>> [...]
>>> (b) Add several new LLVM instructions such as, for parallelism,
fork, spawn,
>>>   join, barrier, etc.
>>> [...]
>> 
>> For me fork and spawn are serving the same purpose, most new schemes
suggested three new instructions in total.
> 
> A reasonable question is whether to use (#b) first-class instructions for
some features, *in combination with* (#d) — i.e., region markers + metadata — or
to use #d exclusively.  There are almost certainly far too many essential
features in parallel programs to capture them all as new instructions. I don’t
see a need to answer this question on Day 1.  Instead, we can begin with regions
and metadata annotations, and then “promote” a few features to first-class
instructions if the benefit is justified.
> 
> Does that make sense?
Now that I read this, I wonder if it isn’t close to what I tried to express
above :)

— 
Mehdi


> 
> 
>> Also I think we already have a lot of
>> simple constructs in the IR to express high-level information properly,
>> e.g. 'atomicrmw' instructions for high-level reduction. While
we currently
>> lack analysis passes to extract information from such low-level
>> representations, these are certainly possible [2,3]. I would argue that
such
>> analysis are a better way to do things than placing "high-level
intrinsics" in the IR to mark things like reductions.
> 
> IMO, this is too fragile for ensuring performance for realistic
applications. While there is a lot of work on reduction recognition in the
literature going back to the 1980s, there will always be cases where these
algorithms miss a reduction pattern and you get terrible parallel performance.
Second, I’m not convinced that you can use specific operations like “atomicrmw”
to find candidate loop nests in the first place that *might* be reductions,
because the atomic op may in fact be in a library operation linked in as binary
code. Third, this greatly increases the complexity of a “first working
prototype” for a language like OpenMP because you can’t exclude REDUCTION
clauses.  Fourth, I believe array privatization is even harder; more generally,
you cannot rely on heroic compiler optimizations to take the place of explicit
language directives.
> 
> I may be showing my age, but I co-wrote a full HPF-to-MPI source-to-source
compiler in the late 1990s at Rice University with some quite sophisticated
optimizations [1,2], and we and others in the HPF compiler community — both
product and research teams — encountered many cases where compiler analyses were
inadequate. There is a credible view that HPF failed despite very high hopes and
significant industry and research investment because it relied too much on
heroic compiler technology.
> 
> [1] http://vadve.web.engr.illinois.edu/Papers/set_framework.pldi98.ps (PLDI
1998)
> [2]
http://vadve.web.engr.illinois.edu/Papers/SC98.NASBenchmarksStudy/dhpfsc98.pdf
(SC 1998)
> 
> -—Vikram
> 
> // Vikram S. Adve
> // Professor, Department of Computer Science
> // University of Illinois at Urbana-Champaign
> // vadve at illinois.edu
> // http://llvm.org
> 
> 
> 
>> On Jan 19, 2017, at 11:20 AM, via llvm-dev <llvm-dev at
lists.llvm.org> wrote:
>> 
>> Date: Thu, 19 Jan 2017 16:30:51 +0100
>> From: Johannes Doerfert via llvm-dev <llvm-dev at lists.llvm.org>
>> To: Hal Finkel <hfinkel at anl.gov>
>> Cc: llvm-dev <llvm-dev at lists.llvm.org>, Simon Moll
>> 	<moll at cs.uni-saarland.de>
>> Subject: Re: [llvm-dev] [RFC] IR-level Region Annotations
>> Message-ID: <20170119153050.GB26909 at arch-linux-jd.home>
>> Content-Type: text/plain; charset="utf-8"
>> 
>> Hi Hal, Hi Xinmin,
>> 
>> First let me thank you for pushing in this direction, it means more
>> people are interested in some kind of change here.
>> 
>> While "our" RFC will be sent out next week I want to comment
on a specific
>> point of this one right now:
>> 
>>> [...]
>>> (b) Add several new LLVM instructions such as, for parallelism,
fork, spawn,
>>>   join, barrier, etc.
>>> [...]
>> 
>> For me fork and spawn are serving the same purpose, most new schemes
suggested three new instructions in total.
>> 
>>> Options |         Pros         | Cons
>>> [...]
>>> (b)     | Parallelism becomes  | Huge effort for extending all LLVM
passes and
>>>       | first class citizen  | code generation to support new
instructions.
>>>       |                      | A large set of information still
needs to be
>>>       |                      | represented using other means.
>>> [...]
>> 
>> I am especially curious where you get your data from. Tapir [0] (and to
>> some degree PIR [1]) have shown that, counterintuitively, only a few
changes
>> to LLVM passes are needed. Tapir was recently used in an MIT class with
a
>> lot of students and it seemed to work well with only minimal changes
>> to analysis and especially transformation passes.
>> 
>> Also the "code generation" issues you mention do, in my
opinion, apply to
>> _all_ proposed schemes as all have to be lowered to either sequential
>> code or parallel library runtime calls eventually. The
sequentialization
>> for our new Tapir/PIR hybrid has less than 50 lines and generating
>> parallel runtime calls will probably be similar in all schemes anyway.
>> 
>> Regarding the last point, "A large set of information still needs
to be
>> represented using other means", I am curious why this is a bad
thing. I
>> think IR should be simple, each instruction/intrinsic etc. should have
>> clear and minimal semantics. Also I think we already have a lot of
>> simple constructs in the IR to express high-level information properly,
>> e.g. 'atomicrmw' instructions for high-level reduction. While
we currently
>> lack analysis passes to extract information from such low-level
>> representations, these are certainly possible [2,3]. I would argue that
such
>> analysis are a better way to do things than placing "high-level
>> intrinsics" in the IR to mark things like reductions.
>> 
>> Cheers,
>> Johannes, on behalf of the Tapir and PIR team
>> 
>> [0]
https://cpc2016.infor.uva.es/wp-content/uploads/2016/06/CPC2016_paper_12.pdf
>> [1] http://compilers.cs.uni-saarland.de/people/doerfert/parallelcfg.pdf
>> [2] Section 3 in https://arxiv.org/pdf/1505.07716
>> [3] Section 3.2 and 3.3 in
https://www.st.cs.uni-saarland.de/publications/files/streit-taco-2015.pdf
> 
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

Daniel Berlin via llvm-dev

2017-Jan-19 20:04 UTC

head link

[llvm-dev] [RFC] IR-level Region Annotations

On Thu, Jan 19, 2017 at 11:46 AM, Mehdi Amini via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
>
> > On Jan 19, 2017, at 11:36 AM, Adve, Vikram Sadanand via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
> >
> > Hi Johannes,
> >
> >> I am especially curious where you get your data from. Tapir [0]
(and to
> >> some degree PIR [1]) have shown that, counterintuitively, only a
few
> changes
> >> to LLVM passes are needed. Tapir was recently used in an MIT class
with
> a
> >> lot of students and it seemed to work well with only minimal
changes
> >> to analysis and especially transformation passes.
> >
> > TAPIR is an elegant, small extension and, in particular, I think the
> idea of asymmetric parallel tasks and control flow is a clever way to
> express parallelism with serial semantics, as in Cilk.  Encoding the
> control flow extensions as explicit instructions is orthogonal to that,
> though arguably more elegant than using region tags + metadata.
> >
> > However, Cilk is a tiny language compared with the full complexity of
> other languages, like OpenMP.  To take just one example, TAPIR cannot
> express the ORDERED construct of OpenMP.  A more serious concern, IMO, is
> that TAPIR (like Cilk) requires serial semantics, whereas there are many
> parallel languages, OpenMP included, that do not obey that restriction.
> Third, OpenMP has *numerous* clauses, e.g., REDUCTION or PRIVATE,  that are
> needed because without that, you’d be dependent on fundamentally hard
> compiler analyses to extract the same information for satisfactory parallel
> performance; realistic applications cannot depend on the success of such
> analyses.
>
> I agree with this, but I’m also wondering if it needs to be first class in
> the IR?
> For example we know our alias analysis is very basic, and C/C++ have a
> higher constraint thanks to their type system, but we didn’t inject this
> higher level information that helps the optimizer as first class IR
> constructs.
>
FWIW, while i agree with the general point, i wouldn't use this example.
Because we pretty much still suffer to this day because of it (both in AA,
and devirt, and ...)  :)
We can't always even tell fields apart
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170119/b8ef48a0/attachment.html>

Adve, Vikram Sadanand via llvm-dev

2017-Jan-19 20:37 UTC

head link

[llvm-dev] [RFC] IR-level Region Annotations

> On Jan 19, 2017, at 1:46 PM, Mehdi Amini <mehdi.amini at apple.com>
wrote:
> 
>> 
>> On Jan 19, 2017, at 11:36 AM, Adve, Vikram Sadanand via llvm-dev
<llvm-dev at lists.llvm.org> wrote:
>> 
<snip>>> Third, OpenMP has *numerous* clauses, e.g., REDUCTION or PRIVATE,  that
are needed because without that, you’d be dependent on fundamentally hard
compiler analyses to extract the same information for satisfactory parallel
performance; realistic applications cannot depend on the success of such
analyses.
> 
> I agree with this, but I’m also wondering if it needs to be first class in
the IR?
> For example we know our alias analysis is very basic, and C/C++ have a
higher constraint thanks to their type system, but we didn’t inject this higher
level information that helps the optimizer as first class IR constructs.
> I wonder if the same wouldn’t apply to an openmp reduction clause for
instance, where you could use the “basic” IR construct and the analysis would
use a metadata emitted by the frontend instead of trying to infer the reduction.

As Danny said (next message), lack of strong AA does limit some of our
optimizations today.  Even beyond that, however, parallel performance is a
different beast from scalar optimizations.  If you don’t devirtualize a subset
of indirect calls, you may take a few % hit.  If you fail to recognize a
reduction  and run the loop serially instead of in parallel, you could easily
take a 10x or greater slowdown on that loop, leading to a substantial slowdown
overall, depending on the importance of the reduction.

Put another way, individual parallel optimizations (whether done manually or by
a compiler) have far larger performance effects, generally speaking, than
individual scalar optimizations.

> 
> Just a thought, I have given much time studying other constructs and how
they map to the IR :)
> 
>>>> [...]
>>>> (b) Add several new LLVM instructions such as, for parallelism,
fork, spawn,
>>>>  join, barrier, etc.
>>>> [...]
>>> 
>>> For me fork and spawn are serving the same purpose, most new
schemes suggested three new instructions in total.
>> 
>> A reasonable question is whether to use (#b) first-class instructions
for some features, *in combination with* (#d) — i.e., region markers + metadata
— or to use #d exclusively.  There are almost certainly far too many essential
features in parallel programs to capture them all as new instructions. I don’t
see a need to answer this question on Day 1.  Instead, we can begin with regions
and metadata annotations, and then “promote” a few features to first-class
instructions if the benefit is justified.
>> 
>> Does that make sense?
> 
> Now that I read this, I wonder if it isn’t close to what I tried to express
above :)

:-).

-—Vikram

// Vikram S. Adve
// Professor, Department of Computer Science
// University of Illinois at Urbana-Champaign
// vadve at illinois.edu
// http://llvm.org

> 
> — 
> Mehdi
> 
> 
> 
>> 
>> 
>>> Also I think we already have a lot of
>>> simple constructs in the IR to express high-level information
properly,
>>> e.g. 'atomicrmw' instructions for high-level reduction.
While we currently
>>> lack analysis passes to extract information from such low-level
>>> representations, these are certainly possible [2,3]. I would argue
that such
>>> analysis are a better way to do things than placing
"high-level intrinsics" in the IR to mark things like reductions.
>> 
>> IMO, this is too fragile for ensuring performance for realistic
applications. While there is a lot of work on reduction recognition in the
literature going back to the 1980s, there will always be cases where these
algorithms miss a reduction pattern and you get terrible parallel performance.
Second, I’m not convinced that you can use specific operations like “atomicrmw”
to find candidate loop nests in the first place that *might* be reductions,
because the atomic op may in fact be in a library operation linked in as binary
code. Third, this greatly increases the complexity of a “first working
prototype” for a language like OpenMP because you can’t exclude REDUCTION
clauses.  Fourth, I believe array privatization is even harder; more generally,
you cannot rely on heroic compiler optimizations to take the place of explicit
language directives.
>> 
>> I may be showing my age, but I co-wrote a full HPF-to-MPI
source-to-source compiler in the late 1990s at Rice University with some quite
sophisticated optimizations [1,2], and we and others in the HPF compiler
community — both product and research teams — encountered many cases where
compiler analyses were inadequate. There is a credible view that HPF failed
despite very high hopes and significant industry and research investment because
it relied too much on heroic compiler technology.
>> 
>> [1] http://vadve.web.engr.illinois.edu/Papers/set_framework.pldi98.ps
(PLDI 1998)
>> [2]
http://vadve.web.engr.illinois.edu/Papers/SC98.NASBenchmarksStudy/dhpfsc98.pdf
(SC 1998)
>> 
>> -—Vikram
>> 
>> // Vikram S. Adve
>> // Professor, Department of Computer Science
>> // University of Illinois at Urbana-Champaign
>> // vadve at illinois.edu
>> // http://llvm.org
>> 
>> 
>> 
>>> On Jan 19, 2017, at 11:20 AM, via llvm-dev <llvm-dev at
lists.llvm.org> wrote:
>>> 
>>> Date: Thu, 19 Jan 2017 16:30:51 +0100
>>> From: Johannes Doerfert via llvm-dev <llvm-dev at
lists.llvm.org>
>>> To: Hal Finkel <hfinkel at anl.gov>
>>> Cc: llvm-dev <llvm-dev at lists.llvm.org>, Simon Moll
>>> 	<moll at cs.uni-saarland.de>
>>> Subject: Re: [llvm-dev] [RFC] IR-level Region Annotations
>>> Message-ID: <20170119153050.GB26909 at arch-linux-jd.home>
>>> Content-Type: text/plain; charset="utf-8"
>>> 
>>> Hi Hal, Hi Xinmin,
>>> 
>>> First let me thank you for pushing in this direction, it means more
>>> people are interested in some kind of change here.
>>> 
>>> While "our" RFC will be sent out next week I want to
comment on a specific
>>> point of this one right now:
>>> 
>>>> [...]
>>>> (b) Add several new LLVM instructions such as, for parallelism,
fork, spawn,
>>>>  join, barrier, etc.
>>>> [...]
>>> 
>>> For me fork and spawn are serving the same purpose, most new
schemes suggested three new instructions in total.
>>> 
>>>> Options |         Pros         | Cons
>>>> [...]
>>>> (b)     | Parallelism becomes  | Huge effort for extending all
LLVM passes and
>>>>      | first class citizen  | code generation to support new
instructions.
>>>>      |                      | A large set of information still
needs to be
>>>>      |                      | represented using other means.
>>>> [...]
>>> 
>>> I am especially curious where you get your data from. Tapir [0]
(and to
>>> some degree PIR [1]) have shown that, counterintuitively, only a
few changes
>>> to LLVM passes are needed. Tapir was recently used in an MIT class
with a
>>> lot of students and it seemed to work well with only minimal
changes
>>> to analysis and especially transformation passes.
>>> 
>>> Also the "code generation" issues you mention do, in my
opinion, apply to
>>> _all_ proposed schemes as all have to be lowered to either
sequential
>>> code or parallel library runtime calls eventually. The
sequentialization
>>> for our new Tapir/PIR hybrid has less than 50 lines and generating
>>> parallel runtime calls will probably be similar in all schemes
anyway.
>>> 
>>> Regarding the last point, "A large set of information still
needs to be
>>> represented using other means", I am curious why this is a bad
thing. I
>>> think IR should be simple, each instruction/intrinsic etc. should
have
>>> clear and minimal semantics. Also I think we already have a lot of
>>> simple constructs in the IR to express high-level information
properly,
>>> e.g. 'atomicrmw' instructions for high-level reduction.
While we currently
>>> lack analysis passes to extract information from such low-level
>>> representations, these are certainly possible [2,3]. I would argue
that such
>>> analysis are a better way to do things than placing
"high-level
>>> intrinsics" in the IR to mark things like reductions.
>>> 
>>> Cheers,
>>> Johannes, on behalf of the Tapir and PIR team
>>> 
>>> [0]
https://cpc2016.infor.uva.es/wp-content/uploads/2016/06/CPC2016_paper_12.pdf
>>> [1]
http://compilers.cs.uni-saarland.de/people/doerfert/parallelcfg.pdf
>>> [2] Section 3 in https://arxiv.org/pdf/1505.07716
>>> [3] Section 3.2 and 3.3 in
https://www.st.cs.uni-saarland.de/publications/files/streit-taco-2015.pdf
>> 
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

Johannes Doerfert via llvm-dev

2017-Jan-20 12:43 UTC

head link

[llvm-dev] [RFC] IR-level Region Annotations

Hi Vikram,

Thanks for the elaborate answer. I will try to be brief with my comment.

On 01/19, Adve, Vikram Sadanand via llvm-dev wrote:> Hi Johannes,
> 
> > I am especially curious where you get your data from. Tapir [0] (and
to
> > some degree PIR [1]) have shown that, counterintuitively, only a few
changes
> > to LLVM passes are needed. Tapir was recently used in an MIT class
with a
> > lot of students and it seemed to work well with only minimal changes
> > to analysis and especially transformation passes.
> 
> TAPIR is an elegant, small extension and, in particular, I think the
> idea of asymmetric parallel tasks and control flow is a clever way to
> express parallelism with serial semantics, as in Cilk.  Encoding the
> control flow extensions as explicit instructions is orthogonal to
> that, though arguably more elegant than using region tags + metadata.
I agree that any form of first class parallelism seems more elegant than
region tags + metadata. Choosing a clever structure for the parallel
constructs can additionally save a lot of work wrt. existing passes,
e.g., the dominance relationship will prevent most illegal
transformations in our proposed fork-join scheme.
> However, Cilk is a tiny language compared with the full complexity of
> other languages, like OpenMP.  To take just one example, TAPIR cannot
> express the ORDERED construct of OpenMP.  
As with reductions, ORDERED is something we might want to express with
low-level constructs we already have, e.g., atomics. Alternatively, the
fork instructions can be extended with "attributes" to express more
elaborate schemes, e.g., the "lockstep" annotation we introduced for
the
PIR white paper. 
> A more serious concern, IMO, is that TAPIR (like Cilk) requires serial
> semantics, whereas there are many parallel languages, OpenMP included,
> that do not obey that restriction.
Serial semantics are good to have but do not need to be enforced. In PIR
we used a "force" attribute to force parallel semantics and disable
potential serialization, though OpenMP does not necessarily enforce
parallelism.
> Third, OpenMP has *numerous* clauses, e.g., REDUCTION or PRIVATE,
> that are needed because without that, you’d be dependent on
> fundamentally hard compiler analyses to extract the same information
> for satisfactory parallel performance; realistic applications cannot
> depend on the success of such analyses.  For example, automatic array
> privatization in parallel loops is a very hard problem (automatic
> scalar privatization is easier, but even that is interprocedural).
> Reduction recognition is doable for common cases, but there are hard
> cases here as well.  These are all standard features of parallel
> programs, not specific to OpenMP (e.g., C++17 parallel template
> operators are likely to produce these as well).
In the end it boils down how much of this information we want/need to
encode as special case (intrinsics/metadata) in the IR and what
information should be analyzed from constructs build with common IR.
Additionally, one has to consider which is easier do deal/optimize with
in the existing passes. In the PIR white paper we shown how OpenMP
(FIRST)PRIVATE clauses can be expressed using only well placed alloca
instructions. From there, analysis is not as hard as automatic
paralellization. 
> If you support all these capabilities in the IR, a *lot* more than
> 6000 LOC (Tapir’s statistic; I’m sorry I don’t recall the number for
> PIR) would probably have to be modified in LLVM.
All in all TAPIR was implemented in 5k LOC. However, >1k are explicit
parallel optimization, 1k is used to add new instructions (thus
mechanical), 2k are used to lower the parallelism (thus needed for
basically all schemes), and the rest is needed to make it work with
existing analysis and transformation passes. To this end I do not see
how any other scheme would require much less.
> >> [...] (b) Add several new LLVM instructions such as, for
> >> parallelism, fork, spawn, join, barrier, etc.  [...]
> > 
> > For me fork and spawn are serving the same purpose, most new schemes
> > suggested three new instructions in total.
> 
> A reasonable question is whether to use (#b) first-class instructions
> for some features, *in combination with* (#d) — i.e., region markers +
> metadata — or to use #d exclusively.  There are almost certainly far
> too many essential features in parallel programs to capture them all
> as new instructions. I don’t see a need to answer this question on Day
> 1.  Instead, we can begin with regions and metadata annotations, and
> then “promote” a few features to first-class instructions if the
> benefit is justified.
> 
> Does that make sense?
Given parallelism as a first-class citizen we can still add new features
in various ways, including intrinsics or metadata. However, if we walk
down a path it is hard to change the direction completely, especially
since region annotations and fork-join parallel regions do serve
overlapping needs.
> > Also I think we already have a lot of simple constructs in the IR to
> > express high-level information properly, e.g. 'atomicrmw'
> > instructions for high-level reduction. While we currently lack
> > analysis passes to extract information from such low-level
> > representations, these are certainly possible [2,3]. I would argue
> > that such analysis are a better way to do things than placing
> > "high-level intrinsics" in the IR to mark things like
reductions.
> 
> IMO, this is too fragile for ensuring performance for realistic
> applications. While there is a lot of work on reduction recognition in
> the literature going back to the 1980s, there will always be cases
> where these algorithms miss a reduction pattern and you get terrible
> parallel performance. 
This might be true but also a problem of current optimizations e.g.,
loop vectorization. Missed optimization opportunities will not go away
one way or another.
> Second, I’m not convinced that you can use specific operations like
> “atomicrmw” to find candidate loop nests in the first place that
> *might* be reductions, because the atomic op may in fact be in a
> library operation linked in as binary code.
I do not get the library operation comment as atomicrmw is an intrinsic.
Also the front-end would place the atomicrmw intrinsics in a parallel
loop so there is no finding of candidate loop nests involved, right? (I
have the feeling you mix automatic parallelization with representation
of existing parallelism here.)
> Third, this greatly increases the complexity of a “first working
> prototype” for a language like OpenMP because you can’t exclude
> REDUCTION clauses.  Fourth, I believe array privatization is even
> harder; more generally, you cannot rely on heroic compiler
> optimizations to take the place of explicit language directives.
As above, it is not about automatic parallelization of sequential
program but expressing parallel constructs and reasoning about them
afterwards. The latter is a different and often much easier problem.
> I may be showing my age, but I co-wrote a full HPF-to-MPI
> source-to-source compiler in the late 1990s at Rice University with
> some quite sophisticated optimizations [1,2], and we and others in the
> HPF compiler community — both product and research teams — encountered
> many cases where compiler analyses were inadequate. There is a
> credible view that HPF failed despite very high hopes and significant
> industry and research investment because it relied too much on heroic
> compiler technology.
> 
> [1] http://vadve.web.engr.illinois.edu/Papers/set_framework.pldi98.ps
> (PLDI 1998) [2]
>
http://vadve.web.engr.illinois.edu/Papers/SC98.NASBenchmarksStudy/dhpfsc98.pdf
> (SC 1998)
> 
> -—Vikram
> 
> // Vikram S. Adve // Professor, Department of Computer Science //
> University of Illinois at Urbana-Champaign // vadve at illinois.edu //
> http://llvm.org
As a final remark I want to stress that a lot of the problems or
complications you mention do not go away with region annotations and/or
intrinsics/metadata. We always have to extract high-level knowledge from
low-level constructs, the question is only how these constructs will
look like. I try to argue that simple low-level constructs are just
better suited for the job as they are easier to understand and
transformed by existing passes. Nevertheless, the will always be cases
we cannot handle well in IR and which should be optimized prior, similar
to the optimizations functional languages do prior to the LLVM IR.

Cheers,
  Johannes
> > On Jan 19, 2017, at 11:20 AM, via llvm-dev <llvm-dev at
lists.llvm.org>
> > wrote:
> > 
> > Date: Thu, 19 Jan 2017 16:30:51 +0100 From: Johannes Doerfert via
> > llvm-dev <llvm-dev at lists.llvm.org> To: Hal Finkel <hfinkel
at anl.gov>
> > Cc: llvm-dev <llvm-dev at lists.llvm.org>, Simon Moll
> > <moll at cs.uni-saarland.de> Subject: Re: [llvm-dev] [RFC]
IR-level
> > Region Annotations Message-ID:
> > <20170119153050.GB26909 at arch-linux-jd.home> Content-Type:
> > text/plain; charset="utf-8"
> > 
> > Hi Hal, Hi Xinmin,
> > 
> > First let me thank you for pushing in this direction, it means more
> > people are interested in some kind of change here.
> > 
> > While "our" RFC will be sent out next week I want to comment
on a
> > specific point of this one right now:
> > 
> >> [...] (b) Add several new LLVM instructions such as, for
> >> parallelism, fork, spawn, join, barrier, etc.  [...]
> > 
> > For me fork and spawn are serving the same purpose, most new schemes
> > suggested three new instructions in total.
> > 
> >> Options |         Pros         | Cons [...] (b)     | Parallelism
> >> becomes  | Huge effort for extending all LLVM passes and | first
> >> class citizen  | code generation to support new instructions.  | |
> >> A large set of information still needs to be | | represented using
> >> other means.  [...]
> > 
> > I am especially curious where you get your data from. Tapir [0] (and
> > to some degree PIR [1]) have shown that, counterintuitively, only a
> > few changes to LLVM passes are needed. Tapir was recently used in an
> > MIT class with a lot of students and it seemed to work well with
> > only minimal changes to analysis and especially transformation
> > passes.
> > 
> > Also the "code generation" issues you mention do, in my
opinion,
> > apply to _all_ proposed schemes as all have to be lowered to either
> > sequential code or parallel library runtime calls eventually. The
> > sequentialization for our new Tapir/PIR hybrid has less than 50
> > lines and generating parallel runtime calls will probably be similar
> > in all schemes anyway.
> > 
> > Regarding the last point, "A large set of information still needs
to
> > be represented using other means", I am curious why this is a bad
> > thing. I think IR should be simple, each instruction/intrinsic etc.
> > should have clear and minimal semantics. Also I think we already
> > have a lot of simple constructs in the IR to express high-level
> > information properly, e.g. 'atomicrmw' instructions for
high-level
> > reduction. While we currently lack analysis passes to extract
> > information from such low-level representations, these are certainly
> > possible [2,3]. I would argue that such analysis are a better way to
> > do things than placing "high-level intrinsics" in the IR to
mark
> > things like reductions.
> > 
> > Cheers, Johannes, on behalf of the Tapir and PIR team
> > 
> > [0]
> >
https://cpc2016.infor.uva.es/wp-content/uploads/2016/06/CPC2016_paper_12.pdf
> > [1]
> > http://compilers.cs.uni-saarland.de/people/doerfert/parallelcfg.pdf
> > [2] Section 3 in https://arxiv.org/pdf/1505.07716 [3] Section 3.2
> > and 3.3 in
> >
https://www.st.cs.uni-saarland.de/publications/files/streit-taco-2015.pdf
> 
> _______________________________________________ LLVM Developers
> mailing list llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
-- 

Johannes Doerfert
Researcher / PhD Student

Compiler Design Lab (Prof. Hack)
Saarland Informatics Campus, Germany
Building E1.3, Room 4.31

Tel. +49 (0)681 302-57521 : doerfert at cs.uni-saarland.de
Fax. +49 (0)681 302-3065  : http://www.cdl.uni-saarland.de/people/doerfert
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 228 bytes
Desc: Digital signature
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170120/4f3559d4/attachment.sig>

llvm dev - Jan 2017 - [RFC] IR-level Region Annotations

[llvm-dev] [RFC] IR-level Region Annotations

[llvm-dev] [RFC] IR-level Region Annotations

[llvm-dev] [RFC] IR-level Region Annotations

[llvm-dev] [RFC] IR-level Region Annotations

[llvm-dev] [RFC] IR-level Region Annotations