thr3ads.net - llvm dev - [LLVMdev] [RFC] Parallelization metadata and intrinsics in LLVM (for OpenMP, etc.) [Oct 2012]

If this information is useful, please help other people find it:
Share via:

Andrey Bokhanko

2012-Oct-02 10:09 UTC

[LLVMdev] [RFC] Parallelization metadata and intrinsics in LLVM (for OpenMP, etc.)

Chris,
> My comment was mostly in response to the Intel proposal, which effectively
translates OpenMP pragmas directly into llvm intrinsics + metadata.  I can't
imagine a way to make this work *correctly* without massive changes to the
optimizer.
There are three ways to make this work correctly:

1) Ignore OpenMP-related intrinsics and associated metadata. Least
effort, least benefit (no OpenMP support). Yet, OpenMP programs
compiled correctly, as if no pragmas are present -- including *exactly
the same* number of routines and call graph (thanks to no
procedurization in front-end). OpenMP specification allow such
compilation. This might be the choice for targets that don't support
OpenMP runtime library.

2) Make procedurization (including all runtime calls -- no intrinsics
left after this step) at the very start of LLVM optimizer. No changes
to optimizations, but no opportunity to optimize parallel code. As
cheap and easy as one can do to support OpenMP. This might be a good
choice for initial implementation.

3) Do some carefully chosen optimizations before procedurization. Do
heavylifting (like loop restructuring optimizations) after
procedurization. Some effort, a lot of benefit. This is essentially
what is described in [Tian05] (referenced in our proposal).

4) Make all optimizations thread-aware. Best approach in theory, no
compilers exist that go as far.

Our proposal make all these choices possible. One can implement 1) in
half an hour, yet keep the door opened for a better solution.

Yours,
Andrey
---
Software Engineer
Intel Compiler Team
Intel Corp.

dag at cray.com

2012-Oct-02 19:47 UTC

head link

[LLVMdev] [RFC] Parallelization metadata and intrinsics in LLVM (for OpenMP, etc.)

Andrey Bokhanko <andreybokhanko at gmail.com> writes:
> There are three ways to make this work correctly:
>
> 1) Ignore OpenMP-related intrinsics and associated metadata. Least
> effort, least benefit (no OpenMP support). Yet, OpenMP programs
> compiled correctly, as if no pragmas are present -- including *exactly
> the same* number of routines and call graph (thanks to no
> procedurization in front-end). OpenMP specification allow such
> compilation. This might be the choice for targets that don't support
> OpenMP runtime library.
Actually, it is perfectly possible to have a program with OpenMP
directives that is NOT valid when those directives are ignored.  In
other words, it's possible to write a legal OMP program that relies on
parallelism to function correctly.  In practice this doesn't happen in
production codes but it's wrong to say the compiler can just ignore
directives with no problems whatsoever.
> 2) Make procedurization (including all runtime calls -- no intrinsics
> left after this step) at the very start of LLVM optimizer. No changes
> to optimizations, but no opportunity to optimize parallel code. As
> cheap and easy as one can do to support OpenMP. This might be a good
> choice for initial implementation.
This should work fine, but then why support intrinsics in LLVM at all.
I understand you're talking about an initial implementation.
> 3) Do some carefully chosen optimizations before procedurization. Do
> heavylifting (like loop restructuring optimizations) after
> procedurization. Some effort, a lot of benefit. This is essentially
> what is described in [Tian05] (referenced in our proposal).
What are the important optimizations?
> 4) Make all optimizations thread-aware. Best approach in theory, no
> compilers exist that go as far.
This is probably not practical.  It may be fine in academia but in
production environments the resources don't exist, unfortunately.

                          -David

Chris Lattner

2012-Oct-03 05:26 UTC

head link

[LLVMdev] [RFC] Parallelization metadata and intrinsics in LLVM (for OpenMP, etc.)

On Oct 2, 2012, at 3:09 AM, Andrey Bokhanko <andreybokhanko at gmail.com>
wrote:> Chris,
> 
>> My comment was mostly in response to the Intel proposal, which
effectively translates OpenMP pragmas directly into llvm intrinsics + metadata. 
I can't imagine a way to make this work *correctly* without massive changes
to the optimizer.
> 
> There are three ways to make this work correctly:
> 
> 1) Ignore OpenMP-related intrinsics and associated metadata.  Least
> effort, least benefit (no OpenMP support). 
This is trivially true, but the entire point of supporting OpenMP in the IR
would be to have some sort of late "procedurization" pass that
actually exposes the parallelism through some runtime.  Saying that we could
just ignore this is silly: if we wanted to ignore OpenMP, we can do that in the
frontend with far less complexity.  In fact, we're already done! ;-)
> 2) Make procedurization (including all runtime calls -- no intrinsics
> left after this step) at the very start of LLVM optimizer. No changes
> to optimizations, but no opportunity to optimize parallel code. As
> cheap and easy as one can do to support OpenMP. This might be a good
> choice for initial implementation.
> 
> 3) Do some carefully chosen optimizations before procedurization. Do
> heavylifting (like loop restructuring optimizations) after
> procedurization. Some effort, a lot of benefit. This is essentially
> what is described in [Tian05] (referenced in our proposal).
I think you're missing the point here.  The whole idea of LLVM IR is that it
doesn't have various "forms" that are valid at different points in
the optimizer.  Even very late lowering passes like strength reduction are pure
IR to IR passes that do not introduce special forms.  This is in stark contrast
to other compilers (e.g. Open64) which have several levels of lowering.

My whole objection comes from the (possibly incorrect, I am not an OpenMP
expert!) idea that there are only two reasonable implementation approaches:

1. Early procedurization (e.g. in the frontend that produces LLVM IR).  This is
very easy to preserve and correctness is trivial, but you lose some
(theoretical?) optimization benefits by doing procedurization early.

2. Late procedurization where the IR has explicit parallelism constructs and all
optimizers preserve its correctness requirements (this is your #4).  While this
is possible in theory, I'm skeptical that this could make sense, and your
proposal certainly isn't the right way to do it.
> 4) Make all optimizations thread-aware. Best approach in theory, no
> compilers exist that go as far.
It's not clear to me exactly what sorts of optimizations that late
procedurization is attempting to allow.  I understand that this is the design
that the Intel compiler uses, and you are motivated to make LLVM fit that model.
However, the technical benefits of this design are not clear to me, and I also
understand that late procedurization has been a continuous source of subtle
correctness bugs that are still being found even though the product is mature. 
This is exactly the sort of thing that I want to avoid in LLVM.

-Chris

Andrey Bokhanko

2012-Oct-03 07:56 UTC

head link

[LLVMdev] [RFC] Parallelization metadata and intrinsics in LLVM (for OpenMP, etc.)

Chris,
> I think you're missing the point here.  The whole idea of LLVM IR is
that it doesn't have various "forms" that are valid at different
points in the optimizer.  Even very late lowering passes like strength reduction
are pure IR to IR passes that do not introduce special forms.  This is in stark
contrast to other compilers (e.g. Open64) which have several levels of lowering.
Well, at some point compiler *has* to insert runtime library calls.
This is true for all proposals, both existing and potential ones. Do
you mean that runtime calls must be inserted either strictly before
LLVM optimizer or strictly after it -- no other place? More on this
later.

As for treating IR with/without OpenMP intrinsics as separate forms,
this is a matter of personal taste and design choice, I guess. Is
strength reduction (that replaces multiplications with additions)
transforms IR into another "form"?
> My whole objection comes from the (possibly incorrect, I am not an OpenMP
expert!) idea that there are only two reasonable implementation approaches:
>
> 1. Early procedurization (e.g. in the frontend that produces LLVM IR). 
This is very easy to preserve and correctness is trivial, but you lose some
(theoretical?) optimization benefits by doing procedurization early.
>
> 2. Late procedurization where the IR has explicit parallelism constructs
and all optimizers preserve its correctness requirements (this is your #4). 
While this is possible in theory, I'm skeptical that this could make sense,
and your proposal certainly isn't the right way to do it.
I understand your point... and respectfully disagree with it.

You basically say that it is all or nothing at all: either *no*
optimizations on parallel code (runtime calls inserted before LLVM
optimizer), or *all* optimizations workable on parallel code (calls
inserted after LLVM optimizer). In former case we lose *all*
optimizations, not some. As for latter, I share your skepticism -- and
duplicate it.
> I understand that this is the design that the Intel compiler uses, and you
are motivated to make LLVM fit that model.
Yes and yes.

And one more: "the proof is in the pudding", or so they say. Intel
Compiler (that, as you correctly noted, uses essentially the same
design) is the metaphorical "pudding" that proves viability and good
performance potential of the approach we proposed.
> I also understand that late procedurization has been a continuous source of
subtle correctness bugs that are still being found even though the product is
mature.
Hmmm... One has to analyze Intel Compiler bugs statistics to make this
assertion, but this is certainly not being impression.

Yours,
Andrey
---
Software Engineer
Intel Compiler Team
Intel Corp.

On Wed, Oct 3, 2012 at 9:26 AM, Chris Lattner <clattner at apple.com>
wrote:> On Oct 2, 2012, at 3:09 AM, Andrey Bokhanko <andreybokhanko at
gmail.com> wrote:
>> Chris,
>>
>>> My comment was mostly in response to the Intel proposal, which
effectively translates OpenMP pragmas directly into llvm intrinsics + metadata. 
I can't imagine a way to make this work *correctly* without massive changes
to the optimizer.
>>
>> There are three ways to make this work correctly:
>>
>> 1) Ignore OpenMP-related intrinsics and associated metadata.  Least
>> effort, least benefit (no OpenMP support).
>
> This is trivially true, but the entire point of supporting OpenMP in the IR
would be to have some sort of late "procedurization" pass that
actually exposes the parallelism through some runtime.  Saying that we could
just ignore this is silly: if we wanted to ignore OpenMP, we can do that in the
frontend with far less complexity.  In fact, we're already done! ;-)
>
>> 2) Make procedurization (including all runtime calls -- no intrinsics
>> left after this step) at the very start of LLVM optimizer. No changes
>> to optimizations, but no opportunity to optimize parallel code. As
>> cheap and easy as one can do to support OpenMP. This might be a good
>> choice for initial implementation.
>>
>> 3) Do some carefully chosen optimizations before procedurization. Do
>> heavylifting (like loop restructuring optimizations) after
>> procedurization. Some effort, a lot of benefit. This is essentially
>> what is described in [Tian05] (referenced in our proposal).
>
> I think you're missing the point here.  The whole idea of LLVM IR is
that it doesn't have various "forms" that are valid at different
points in the optimizer.  Even very late lowering passes like strength reduction
are pure IR to IR passes that do not introduce special forms.  This is in stark
contrast to other compilers (e.g. Open64) which have several levels of lowering.
>
> My whole objection comes from the (possibly incorrect, I am not an OpenMP
expert!) idea that there are only two reasonable implementation approaches:
>
> 1. Early procedurization (e.g. in the frontend that produces LLVM IR). 
This is very easy to preserve and correctness is trivial, but you lose some
(theoretical?) optimization benefits by doing procedurization early.
>
> 2. Late procedurization where the IR has explicit parallelism constructs
and all optimizers preserve its correctness requirements (this is your #4). 
While this is possible in theory, I'm skeptical that this could make sense,
and your proposal certainly isn't the right way to do it.
>
>> 4) Make all optimizations thread-aware. Best approach in theory, no
>> compilers exist that go as far.
>
> It's not clear to me exactly what sorts of optimizations that late
procedurization is attempting to allow.  I understand that this is the design
that the Intel compiler uses, and you are motivated to make LLVM fit that model.
However, the technical benefits of this design are not clear to me, and I also
understand that late procedurization has been a continuous source of subtle
correctness bugs that are still being found even though the product is mature. 
This is exactly the sort of thing that I want to avoid in LLVM.
>
> -Chris

Andrey Bokhanko

2012-Oct-03 08:30 UTC

head link

[LLVMdev] [RFC] Parallelization metadata and intrinsics in LLVM (for OpenMP, etc.)

David,
> Actually, it is perfectly possible to have a program with OpenMP
> directives that is NOT valid when those directives are ignored.  In
> other words, it's possible to write a legal OMP program that relies on
> parallelism to function correctly.  In practice this doesn't happen in
> production codes but it's wrong to say the compiler can just ignore
> directives with no problems whatsoever.
You might be right. But this is as good as one can do compiling an
OpenMP program for a target with no OpenMP support.
> What are the important optimizations?
You mean "that should be done before procedurization"?

As you understand, there is only way to know -- try it.

As been mentioned elsewhere, Intel Compiler employs essentially the
same design as we proposed. [Tian05] (use this link to access the
paper:
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.97.3763&rep=rep1&type=pdf)
describes phase ordering that Intel Compiler developers found to
provide good performance while preserving correctness.
>> 4) Make all optimizations thread-aware. Best approach in theory, no
>> compilers exist that go as far.
>
> This is probably not practical.  It may be fine in academia but in
> production environments the resources don't exist, unfortunately.
I do agree! :-)

That's why we propose what we propose -- the design leaves all doors opened.

Yours,
Andrey
---
Software Engineer
Intel Compiler Team
Intel Corp.

Reasonably Related Threads

Search for more maybe matching threads

llvm dev - Oct 2012 - [LLVMdev] [RFC] Parallelization metadata and intrinsics in LLVM (for OpenMP, etc.)

[LLVMdev] [RFC] Parallelization metadata and intrinsics in LLVM (for OpenMP, etc.)

[LLVMdev] [RFC] Parallelization metadata and intrinsics in LLVM (for OpenMP, etc.)

[LLVMdev] [RFC] Parallelization metadata and intrinsics in LLVM (for OpenMP, etc.)

[LLVMdev] [RFC] Parallelization metadata and intrinsics in LLVM (for OpenMP, etc.)

[LLVMdev] [RFC] Parallelization metadata and intrinsics in LLVM (for OpenMP, etc.)

Reasonably Related Threads