thr3ads.net - llvm dev - [llvm-dev] [RFC] IR-level Region Annotations [Feb 2017]

If this information is useful, please help other people find it:
Share via:

Mehdi Amini via llvm-dev

2017-Jan-20 18:54 UTC

[llvm-dev] [RFC] IR-level Region Annotations

> On Jan 20, 2017, at 10:44 AM, Tian, Xinmin via llvm-dev <llvm-dev at
lists.llvm.org> wrote:
> 
> Sanjoy, the IR would be like something below. It is ok to hoist alloca
instruction outside the region. There are some small changes in optimizer to
understand region-annotation intrinsic.
> 
> { void main() {
>  i32* val = alloca i32
>  tok = llvm.experimental.intrinsic_a()[ "DIR.PARALLEL"(),
"QUAL.PRIVATE"(i32* val), "QUAL.NUM_THREADS"(i32 4)]
> 
>  int i = omp_get_thread_num();
>  compute_something_into_val(val, i);
>  a[i] = val;
> 
>  llvm.experimental.intrinsic_b(tok)["DIR.END.PARALLEL"()];
> }
> 
> With above representation, we can do privatization and outlining as below
> 
> { void main() {
>  i32* val = alloca i32
>  i32* I = alloca 32
>  tok = llvm.experimental.intrinsic_a()[ "DIR.PARALLEL"(),
"QUAL.PRIVATE"(i32* %val, i32 %i), "QUAL.NUM_THREADS"(i32
4)]
> 
>  %ii = omp_get_thread_num();
>  compute_something_into_val(%val, %i);
>  a[i] = %val;
> 
>  llvm.experimental.intrinsic_b(tok)["DIR.END.PARALLEL"()];
> }
> 
Here we come to the interesting part: the hoisting of  "i32* I = alloca 32”
above the intrinsics required to update the intrinsics information
“QUAL.PRIVATE”.
This means that the optimizer has to be aware of it, I’m missing the magic here?
I understand that an openmp specific optimization can do it, the question is how
it an openmp agnostic supposed to behave in face of
llvm.experimental.intrinsic_a?

— 
Mehdi


> 1. create i32* priv_val = alloca i32  %priv_i = ...in the region, and
replace all  %val with %prv_val in the region.
> 2. perform outlining. 
> 
> Caller code
> ....
> omp_push_num_threads(4)
> omp_fork_call( .... outline_par_region....) ....
> 
> Callee code: 
> Outlined_par_rgion {
>   I32* priv_val = alloca 32
>   I32* priv_i = ....
> 
>    Ret
> }
> 
> For OpenMP, we do support it at -O0, -O1, -O2 and -O3.  We had to make sure
it runs correctly w/ and w/o optimizations and advanced analysis. So we need to
preserve all source information for BE.
> You can take a look our LLVM-HPC paper for a bit some details.  There are
still tons of work to be done. Thanks.
> 
> Xinmin 
> 
> -----Original Message-----
> From: llvm-dev [mailto:llvm-dev-bounces at lists.llvm.org] On Behalf Of
Sanjoy Das via llvm-dev
> Sent: Thursday, January 19, 2017 10:13 PM
> To: Adve, Vikram Sadanand <vadve at illinois.edu>
> Cc: llvm-dev <llvm-dev at lists.llvm.org>; llvm-dev-request at
lists.llvm.org
> Subject: Re: [llvm-dev] [RFC] IR-level Region Annotations
> 
> Hi Vikram,
> 
> On Thu, Jan 19, 2017 at 9:27 PM, Adve, Vikram Sadanand <vadve at
illinois.edu> wrote:
>> Hi Sanjoy,
>> 
>> Yes, that's exactly what we have been looking at recently here, but
>> the region tags seem to make it possible to express the control flow 
>> as well, so I think we could start with reg ions+metadata, as Hal and
> 
> I'm not yet convinced that region tags are sufficient to model exotic
control flow.
> 
> (I don't know OpenMP so this is a copy-pasted-edited example)
> 
> Say we have:
> 
> void main() {
>  #pragma omp parallel num_threads(4)
>  {
>    int i = omp_get_thread_num();
>    int val;
>    compute_something_into_val(&val, i);
>    a[i] = val;
>  }
> }
> 
> I presume the (eventual) intended lowering is something like this (if the
intended lowering is different than this, and avoids the issue I'm trying to
highlight then my point is moot):
> 
> void main() {
>  tok = llvm.experimental.intrinsic_a();
> 
>  int i = omp_get_thread_num();
>  i32* val = alloca i32
>  compute_something_into_val(val, i);
>  a[i] = val;
> 
>  llvm.experimental.intrinsic_b(tok);
> }
> 
> However, LLVM is free to hoist the alloca to the entry block:
> 
> void main() {
>  i32* val = alloca i32
>  tok = llvm.experimental.intrinsic_a();
> 
>  int i = omp_get_thread_num();
>  compute_something_into_val(val, i);
>  a[i] = val;
> 
>  llvm.experimental.intrinsic_b(tok);
> }
> 
> and now you have a race between the four parallel forks.
> 
> The problem here is that nothing in the IR expresses that we have four
copies of the region running "at the same time".  In fact, such a
control flow is alien to LLVM today.
> 
> For instance, another evil optimization may turn:
> 
> void main() {
>  int a[4];
>  #pragma omp parallel num_threads(4)
>  {
>    int i = omp_get_thread_num();
>    int val = compute_something_into_val(i);
>    a[i] = val;
>  }
> 
>  return a[0] + a[1];
> }
> 
> to
> 
> void main() {
>  int a[4];
>  #pragma omp parallel num_threads(4)
>  {
>    int i = omp_get_thread_num();
>    int val = compute_something_into_val(i);
>    a[i] = val;
>  }
> 
>  return undef;
> }
> 
> since a[i] = val could have initialized at most one element in a.
> 
> Now you could say that the llvm.experimental.intrinsic_a and
llvm.experimental.intrinsic_b intrinsics are magic, and even such
"obvious" optimizations are not allowed to happen across them; but
then calls to these intrinsics is pretty fundamentally different from
"normal" calls, and are probably best modeled as new instructions.
> You're going to have to do the same kind of auditing of passes either
way, and the only extra cost of a new instruction is the extra bitcode reading /
writing code.
> 
> I hope I made sense.
> 
> -- Sanjoy
> 
>> Xinmin proposed, and then figure out what needs to be first class 
>> instructions.
> 
>> 
>> --Vikram Adve
>> 
>> 
>> 
>>> On Jan 19, 2017, at 11:03 PM, Sanjoy Das <sanjoy at
playingwithpointers.com> wrote:
>>> 
>>> Hi,
>>> 
>>> My bias is to use both (b) and (d), since they have complementary 
>>> strengths.  We should use (b) for expressing concepts that
can't be
>>> semantically modeled as a call or invoke (this branch takes both
its
>>> successors), and (d) for expressing things that can be (this call
may
>>> never return), and annotation like things (this region (denoted by 
>>> def-use of a token) is a reduction).
>>> 
>>> I don't grok OpenMP, but perhaps we can come with one or two 
>>> "generalized control flow"-type instructions that can be
used to
>>> model the non-call/invoke like semantics we'd like LLVM to know
>>> about, and model the rest with (d)?
>>> 
>>> -- Sanjoy
>>> 
>>> On Thu, Jan 19, 2017 at 8:28 PM, Hal Finkel via llvm-dev 
>>> <llvm-dev at lists.llvm.org> wrote:
>>>> 
>>>> On 01/19/2017 03:36 PM, Mehdi Amini via llvm-dev wrote:
>>>> 
>>>> 
>>>> On Jan 19, 2017, at 1:32 PM, Daniel Berlin <dberlin at
dberlin.org> wrote:
>>>> 
>>>> 
>>>> 
>>>>> On Thu, Jan 19, 2017 at 1:12 PM, Mehdi Amini
<mehdi.amini at apple.com> wrote:
>>>>> 
>>>>> 
>>>>> On Jan 19, 2017, at 12:04 PM, Daniel Berlin <dberlin at
dberlin.org> wrote:
>>>>> 
>>>>> 
>>>>> 
>>>>> On Thu, Jan 19, 2017 at 11:46 AM, Mehdi Amini via llvm-dev 
>>>>> <llvm-dev at lists.llvm.org> wrote:
>>>>>> 
>>>>>> 
>>>>>>> On Jan 19, 2017, at 11:36 AM, Adve, Vikram Sadanand
via llvm-dev
>>>>>>> <llvm-dev at lists.llvm.org> wrote:
>>>>>>> 
>>>>>>> Hi Johannes,
>>>>>>> 
>>>>>>>> I am especially curious where you get your data
from. Tapir [0]
>>>>>>>> (and to some degree PIR [1]) have shown that, 
>>>>>>>> counterintuitively, only a few changes to LLVM
passes are
>>>>>>>> needed. Tapir was recently used in an MIT class
with a lot of
>>>>>>>> students and it seemed to work well with only
minimal changes to
>>>>>>>> analysis and especially transformation passes.
>>>>>>> 
>>>>>>> TAPIR is an elegant, small extension and, in
particular, I think
>>>>>>> the idea of asymmetric parallel tasks and control
flow is a
>>>>>>> clever way to express parallelism with serial
semantics, as in
>>>>>>> Cilk.  Encoding the control flow extensions as
explicit
>>>>>>> instructions is orthogonal to that, though arguably
more elegant than using region tags + metadata.
>>>>>>> 
>>>>>>> However, Cilk is a tiny language compared with the
full
>>>>>>> complexity of other languages, like OpenMP.  To
take just one
>>>>>>> example, TAPIR cannot express the ORDERED construct
of OpenMP.  A
>>>>>>> more serious concern, IMO, is that TAPIR (like
Cilk) requires
>>>>>>> serial semantics, whereas there are many parallel
languages, OpenMP included, that do not obey that restriction.
>>>>>>> Third, OpenMP has *numerous* clauses, e.g.,
REDUCTION or PRIVATE,
>>>>>>> that are needed because without that, you’d be
dependent on
>>>>>>> fundamentally hard compiler analyses to extract the
same
>>>>>>> information for satisfactory parallel performance;
realistic
>>>>>>> applications cannot depend on the success of such
analyses.
>>>>>> 
>>>>>> I agree with this, but I’m also wondering if it needs
to be first
>>>>>> class in the IR?
>>>>>> For example we know our alias analysis is very basic,
and C/C++
>>>>>> have a higher constraint thanks to their type system,
but we
>>>>>> didn’t inject this higher level information that helps
the
>>>>>> optimizer as first class IR constructs.
>>>>> 
>>>>> 
>>>>> FWIW, while i agree with the general point, i wouldn't
use this example.
>>>>> Because we pretty much still suffer to this day because of
it (both
>>>>> in AA, and devirt, and ...)  :) We can't always even
tell fields
>>>>> apart
>>>>> 
>>>>> 
>>>>> Is it inherent to the infrastructure, i.e. using metadata
instead
>>>>> of first class IR construct or is it just a “quality of
implementation” issue?
>>>> 
>>>> Not to derail this conversation:
>>>> 
>>>> IMHO, At some point there is no real difference :)
>>>> 
>>>> Because otherwise, everything is a QOI issue.
>>>> 
>>>> IE if it's super tricky to get metadata that works well and
works
>>>> right, doesn't get lost, etc, and that's inherent to
using metadata,
>>>> that to me is not a QOI issue.
>>>> 
>>>> So could it be done with metadata? Probably?
>>>> But at the same time,  if it had been done with more first
class
>>>> constructs, it would have happened years ago  and been much
lower cost.
>>>> 
>>>> 
>>>> This is what I meant by “inherent to the infrastructure”,
thanks for
>>>> clarifying.
>>>> 
>>>> 
>>>> To clarify, we were proposing metadata that is used as
arguments to
>>>> the region-annotation intrinsics. This metadata has the nice 
>>>> property that it does not get dropped (so it is just being used
as a
>>>> way of encoding whatever data structures are necessary without
predefining a syntactic schema).
>>>> 
>>>> -Hal
>>>> 
>>>> 
>>>> —
>>>> Mehdi
>>>> 
>>>> 
>>>> 
>>>> 
>>>> _______________________________________________
>>>> LLVM Developers mailing list
>>>> llvm-dev at lists.llvm.org
>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>> 
>>>> 
>>>> --
>>>> Hal Finkel
>>>> Lead, Compiler Technology and Programming Languages Leadership 
>>>> Computing Facility Argonne National Laboratory
>>>> 
>>>> 
>>>> _______________________________________________
>>>> LLVM Developers mailing list
>>>> llvm-dev at lists.llvm.org
>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>> 
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

Tian, Xinmin via llvm-dev

2017-Jan-20 19:17 UTC

head link

[llvm-dev] [RFC] IR-level Region Annotations

>>>>This means that the optimizer has to be aware of it, I’m missing
the magic here?
This is one option.  

The another option is that, as I mentioned in our LLVM-HPC paper in our
implementation. We have a "prepare phase for pre-privatization" can be
invoked by both Clang FE and Fortran FE right after LLVM IR is generated. So, in
this way, we are able to minimize the optimizations impact for the original val
and i

{ void main() {
  i32* val = alloca i32
  i32* I = alloca 32
  i32* priv_val = alloca i32
  i32* priv_i alloca 32
  tok = llvm.experimental.intrinsic_a()[
"DIR.PARALLEL"(),"QUAL.PRIVATE"(i32* %priv_val, i32
%priv_i), "QUAL.NUM_THREADS"(i32 4)]

  %priv_i = omp_get_thread_num();
  compute_something_into_val(%priv_val, %priv_i);  
   a[priv_i] = %priv_val;

  llvm.experimental.intrinsic_b(tok)["DIR.END.PARALLEL"()];
  ....
 
  I = 
  Val    Foo(val, i). 
 }

"Prepare phase" is our way of minimizing the impact to existing
optimizations.
 
Xinmin

-----Original Message-----
From: mehdi.amini at apple.com [mailto:mehdi.amini at apple.com] 
Sent: Friday, January 20, 2017 10:54 AM
To: Tian, Xinmin
Cc: Sanjoy Das; Adve, Vikram Sadanand; llvm-dev at lists.llvm.org;
llvm-dev-request at lists.llvm.org
Subject: Re: [llvm-dev] [RFC] IR-level Region Annotations

> On Jan 20, 2017, at 10:44 AM, Tian, Xinmin via llvm-dev <llvm-dev at
lists.llvm.org> wrote:
> 
> Sanjoy, the IR would be like something below. It is ok to hoist alloca
instruction outside the region. There are some small changes in optimizer to
understand region-annotation intrinsic.
> 
> { void main() {
>  i32* val = alloca i32
>  tok = llvm.experimental.intrinsic_a()[ "DIR.PARALLEL"(), 
> "QUAL.PRIVATE"(i32* val), "QUAL.NUM_THREADS"(i32 4)]
> 
>  int i = omp_get_thread_num();
>  compute_something_into_val(val, i);
>  a[i] = val;
> 
>  llvm.experimental.intrinsic_b(tok)["DIR.END.PARALLEL"()];
> }
> 
> With above representation, we can do privatization and outlining as 
> below
> 
> { void main() {
>  i32* val = alloca i32
>  i32* I = alloca 32
>  tok = llvm.experimental.intrinsic_a()[ "DIR.PARALLEL"(), 
> "QUAL.PRIVATE"(i32* %val, i32 %i),
"QUAL.NUM_THREADS"(i32 4)]
> 
>  %ii = omp_get_thread_num();
>  compute_something_into_val(%val, %i);  a[i] = %val;
> 
>  llvm.experimental.intrinsic_b(tok)["DIR.END.PARALLEL"()];
> }
> 
Here we come to the interesting part: the hoisting of  "i32* I = alloca 32”
above the intrinsics required to update the intrinsics information
“QUAL.PRIVATE”.
This means that the optimizer has to be aware of it, I’m missing the magic here?
I understand that an openmp specific optimization can do it, the question is how
it an openmp agnostic supposed to behave in face of
llvm.experimental.intrinsic_a?

—
Mehdi


> 1. create i32* priv_val = alloca i32  %priv_i = ...in the region, and
replace all  %val with %prv_val in the region.
> 2. perform outlining. 
> 
> Caller code
> ....
> omp_push_num_threads(4)
> omp_fork_call( .... outline_par_region....) ....
> 
> Callee code: 
> Outlined_par_rgion {
>   I32* priv_val = alloca 32
>   I32* priv_i = ....
> 
>    Ret
> }
> 
> For OpenMP, we do support it at -O0, -O1, -O2 and -O3.  We had to make sure
it runs correctly w/ and w/o optimizations and advanced analysis. So we need to
preserve all source information for BE.
> You can take a look our LLVM-HPC paper for a bit some details.  There are
still tons of work to be done. Thanks.
> 
> Xinmin 
> 
> -----Original Message-----
> From: llvm-dev [mailto:llvm-dev-bounces at lists.llvm.org] On Behalf Of
Sanjoy Das via llvm-dev
> Sent: Thursday, January 19, 2017 10:13 PM
> To: Adve, Vikram Sadanand <vadve at illinois.edu>
> Cc: llvm-dev <llvm-dev at lists.llvm.org>; llvm-dev-request at
lists.llvm.org
> Subject: Re: [llvm-dev] [RFC] IR-level Region Annotations
> 
> Hi Vikram,
> 
> On Thu, Jan 19, 2017 at 9:27 PM, Adve, Vikram Sadanand <vadve at
illinois.edu> wrote:
>> Hi Sanjoy,
>> 
>> Yes, that's exactly what we have been looking at recently here, but
>> the region tags seem to make it possible to express the control flow 
>> as well, so I think we could start with reg ions+metadata, as Hal and
> 
> I'm not yet convinced that region tags are sufficient to model exotic
control flow.
> 
> (I don't know OpenMP so this is a copy-pasted-edited example)
> 
> Say we have:
> 
> void main() {
>  #pragma omp parallel num_threads(4)
>  {
>    int i = omp_get_thread_num();
>    int val;
>    compute_something_into_val(&val, i);
>    a[i] = val;
>  }
> }
> 
> I presume the (eventual) intended lowering is something like this (if the
intended lowering is different than this, and avoids the issue I'm trying to
highlight then my point is moot):
> 
> void main() {
>  tok = llvm.experimental.intrinsic_a();
> 
>  int i = omp_get_thread_num();
>  i32* val = alloca i32
>  compute_something_into_val(val, i);
>  a[i] = val;
> 
>  llvm.experimental.intrinsic_b(tok);
> }
> 
> However, LLVM is free to hoist the alloca to the entry block:
> 
> void main() {
>  i32* val = alloca i32
>  tok = llvm.experimental.intrinsic_a();
> 
>  int i = omp_get_thread_num();
>  compute_something_into_val(val, i);
>  a[i] = val;
> 
>  llvm.experimental.intrinsic_b(tok);
> }
> 
> and now you have a race between the four parallel forks.
> 
> The problem here is that nothing in the IR expresses that we have four
copies of the region running "at the same time".  In fact, such a
control flow is alien to LLVM today.
> 
> For instance, another evil optimization may turn:
> 
> void main() {
>  int a[4];
>  #pragma omp parallel num_threads(4)
>  {
>    int i = omp_get_thread_num();
>    int val = compute_something_into_val(i);
>    a[i] = val;
>  }
> 
>  return a[0] + a[1];
> }
> 
> to
> 
> void main() {
>  int a[4];
>  #pragma omp parallel num_threads(4)
>  {
>    int i = omp_get_thread_num();
>    int val = compute_something_into_val(i);
>    a[i] = val;
>  }
> 
>  return undef;
> }
> 
> since a[i] = val could have initialized at most one element in a.
> 
> Now you could say that the llvm.experimental.intrinsic_a and
llvm.experimental.intrinsic_b intrinsics are magic, and even such
"obvious" optimizations are not allowed to happen across them; but
then calls to these intrinsics is pretty fundamentally different from
"normal" calls, and are probably best modeled as new instructions.
> You're going to have to do the same kind of auditing of passes either
way, and the only extra cost of a new instruction is the extra bitcode reading /
writing code.
> 
> I hope I made sense.
> 
> -- Sanjoy
> 
>> Xinmin proposed, and then figure out what needs to be first class 
>> instructions.
> 
>> 
>> --Vikram Adve
>> 
>> 
>> 
>>> On Jan 19, 2017, at 11:03 PM, Sanjoy Das <sanjoy at
playingwithpointers.com> wrote:
>>> 
>>> Hi,
>>> 
>>> My bias is to use both (b) and (d), since they have complementary 
>>> strengths.  We should use (b) for expressing concepts that
can't be
>>> semantically modeled as a call or invoke (this branch takes both
its
>>> successors), and (d) for expressing things that can be (this call
may
>>> never return), and annotation like things (this region (denoted by 
>>> def-use of a token) is a reduction).
>>> 
>>> I don't grok OpenMP, but perhaps we can come with one or two 
>>> "generalized control flow"-type instructions that can be
used to
>>> model the non-call/invoke like semantics we'd like LLVM to know
>>> about, and model the rest with (d)?
>>> 
>>> -- Sanjoy
>>> 
>>> On Thu, Jan 19, 2017 at 8:28 PM, Hal Finkel via llvm-dev 
>>> <llvm-dev at lists.llvm.org> wrote:
>>>> 
>>>> On 01/19/2017 03:36 PM, Mehdi Amini via llvm-dev wrote:
>>>> 
>>>> 
>>>> On Jan 19, 2017, at 1:32 PM, Daniel Berlin <dberlin at
dberlin.org> wrote:
>>>> 
>>>> 
>>>> 
>>>>> On Thu, Jan 19, 2017 at 1:12 PM, Mehdi Amini
<mehdi.amini at apple.com> wrote:
>>>>> 
>>>>> 
>>>>> On Jan 19, 2017, at 12:04 PM, Daniel Berlin <dberlin at
dberlin.org> wrote:
>>>>> 
>>>>> 
>>>>> 
>>>>> On Thu, Jan 19, 2017 at 11:46 AM, Mehdi Amini via llvm-dev 
>>>>> <llvm-dev at lists.llvm.org> wrote:
>>>>>> 
>>>>>> 
>>>>>>> On Jan 19, 2017, at 11:36 AM, Adve, Vikram Sadanand
via llvm-dev
>>>>>>> <llvm-dev at lists.llvm.org> wrote:
>>>>>>> 
>>>>>>> Hi Johannes,
>>>>>>> 
>>>>>>>> I am especially curious where you get your data
from. Tapir [0]
>>>>>>>> (and to some degree PIR [1]) have shown that, 
>>>>>>>> counterintuitively, only a few changes to LLVM
passes are
>>>>>>>> needed. Tapir was recently used in an MIT class
with a lot of
>>>>>>>> students and it seemed to work well with only
minimal changes to
>>>>>>>> analysis and especially transformation passes.
>>>>>>> 
>>>>>>> TAPIR is an elegant, small extension and, in
particular, I think
>>>>>>> the idea of asymmetric parallel tasks and control
flow is a
>>>>>>> clever way to express parallelism with serial
semantics, as in
>>>>>>> Cilk.  Encoding the control flow extensions as
explicit
>>>>>>> instructions is orthogonal to that, though arguably
more elegant than using region tags + metadata.
>>>>>>> 
>>>>>>> However, Cilk is a tiny language compared with the
full
>>>>>>> complexity of other languages, like OpenMP.  To
take just one
>>>>>>> example, TAPIR cannot express the ORDERED construct
of OpenMP.  A
>>>>>>> more serious concern, IMO, is that TAPIR (like
Cilk) requires
>>>>>>> serial semantics, whereas there are many parallel
languages, OpenMP included, that do not obey that restriction.
>>>>>>> Third, OpenMP has *numerous* clauses, e.g.,
REDUCTION or PRIVATE,
>>>>>>> that are needed because without that, you’d be
dependent on
>>>>>>> fundamentally hard compiler analyses to extract the
same
>>>>>>> information for satisfactory parallel performance;
realistic
>>>>>>> applications cannot depend on the success of such
analyses.
>>>>>> 
>>>>>> I agree with this, but I’m also wondering if it needs
to be first
>>>>>> class in the IR?
>>>>>> For example we know our alias analysis is very basic,
and C/C++
>>>>>> have a higher constraint thanks to their type system,
but we
>>>>>> didn’t inject this higher level information that helps
the
>>>>>> optimizer as first class IR constructs.
>>>>> 
>>>>> 
>>>>> FWIW, while i agree with the general point, i wouldn't
use this example.
>>>>> Because we pretty much still suffer to this day because of
it (both
>>>>> in AA, and devirt, and ...)  :) We can't always even
tell fields
>>>>> apart
>>>>> 
>>>>> 
>>>>> Is it inherent to the infrastructure, i.e. using metadata
instead
>>>>> of first class IR construct or is it just a “quality of
implementation” issue?
>>>> 
>>>> Not to derail this conversation:
>>>> 
>>>> IMHO, At some point there is no real difference :)
>>>> 
>>>> Because otherwise, everything is a QOI issue.
>>>> 
>>>> IE if it's super tricky to get metadata that works well and
works
>>>> right, doesn't get lost, etc, and that's inherent to
using metadata,
>>>> that to me is not a QOI issue.
>>>> 
>>>> So could it be done with metadata? Probably?
>>>> But at the same time,  if it had been done with more first
class
>>>> constructs, it would have happened years ago  and been much
lower cost.
>>>> 
>>>> 
>>>> This is what I meant by “inherent to the infrastructure”,
thanks for
>>>> clarifying.
>>>> 
>>>> 
>>>> To clarify, we were proposing metadata that is used as
arguments to
>>>> the region-annotation intrinsics. This metadata has the nice 
>>>> property that it does not get dropped (so it is just being used
as a
>>>> way of encoding whatever data structures are necessary without
predefining a syntactic schema).
>>>> 
>>>> -Hal
>>>> 
>>>> 
>>>> —
>>>> Mehdi
>>>> 
>>>> 
>>>> 
>>>> 
>>>> _______________________________________________
>>>> LLVM Developers mailing list
>>>> llvm-dev at lists.llvm.org
>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>> 
>>>> 
>>>> --
>>>> Hal Finkel
>>>> Lead, Compiler Technology and Programming Languages Leadership 
>>>> Computing Facility Argonne National Laboratory
>>>> 
>>>> 
>>>> _______________________________________________
>>>> LLVM Developers mailing list
>>>> llvm-dev at lists.llvm.org
>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>> 
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

Mehdi Amini via llvm-dev

2017-Jan-21 21:57 UTC

head link

[llvm-dev] [RFC] IR-level Region Annotations

> On Jan 20, 2017, at 11:17 AM, Tian, Xinmin <xinmin.tian at intel.com>
wrote:
> 
>>>>> This means that the optimizer has to be aware of it, I’m
missing the magic here?
> 
> This is one option.  
> 
> The another option is that, as I mentioned in our LLVM-HPC paper in our
implementation. We have a "prepare phase for pre-privatization" can be
invoked by both Clang FE and Fortran FE right after LLVM IR is generated. So, in
this way, we are able to minimize the optimizations impact for the original val
and I
Ok, but this looks like a “workaround" for your specific use-case, I don’t
see how it can scale as a model-agnostic and general-purpose region semantic.

The fact that you needed this pre-step in the first place seems to indicate to
me that it confirms what multiple people expressed in this thread, for example
what Daniel wrote here:
http://lists.llvm.org/pipermail/llvm-dev/2017-January/108997.html

— 
Mehdi


> 
> { void main() {
>  i32* val = alloca i32
>  i32* I = alloca 32
>  i32* priv_val = alloca i32
>  i32* priv_i alloca 32
>  tok = llvm.experimental.intrinsic_a()[
"DIR.PARALLEL"(),"QUAL.PRIVATE"(i32* %priv_val, i32
%priv_i), "QUAL.NUM_THREADS"(i32 4)]
> 
>  %priv_i = omp_get_thread_num();
>  compute_something_into_val(%priv_val, %priv_i);  
>   a[priv_i] = %priv_val;
> 
>  llvm.experimental.intrinsic_b(tok)["DIR.END.PARALLEL"()];
>  ....
> 
>  I = 
>  Val >   Foo(val, i). 
> }
> 
> "Prepare phase" is our way of minimizing the impact to existing
optimizations.
> 
> Xinmin
> 
> -----Original Message-----
> From: mehdi.amini at apple.com [mailto:mehdi.amini at apple.com] 
> Sent: Friday, January 20, 2017 10:54 AM
> To: Tian, Xinmin
> Cc: Sanjoy Das; Adve, Vikram Sadanand; llvm-dev at lists.llvm.org;
llvm-dev-request at lists.llvm.org
> Subject: Re: [llvm-dev] [RFC] IR-level Region Annotations
> 
> 
>> On Jan 20, 2017, at 10:44 AM, Tian, Xinmin via llvm-dev <llvm-dev at
lists.llvm.org> wrote:
>> 
>> Sanjoy, the IR would be like something below. It is ok to hoist alloca
instruction outside the region. There are some small changes in optimizer to
understand region-annotation intrinsic.
>> 
>> { void main() {
>> i32* val = alloca i32
>> tok = llvm.experimental.intrinsic_a()[ "DIR.PARALLEL"(), 
>> "QUAL.PRIVATE"(i32* val), "QUAL.NUM_THREADS"(i32
4)]
>> 
>> int i = omp_get_thread_num();
>> compute_something_into_val(val, i);
>> a[i] = val;
>> 
>> llvm.experimental.intrinsic_b(tok)["DIR.END.PARALLEL"()];
>> }
>> 
>> With above representation, we can do privatization and outlining as 
>> below
>> 
>> { void main() {
>> i32* val = alloca i32
>> i32* I = alloca 32
>> tok = llvm.experimental.intrinsic_a()[ "DIR.PARALLEL"(), 
>> "QUAL.PRIVATE"(i32* %val, i32 %i),
"QUAL.NUM_THREADS"(i32 4)]
>> 
>> %ii = omp_get_thread_num();
>> compute_something_into_val(%val, %i);  a[i] = %val;
>> 
>> llvm.experimental.intrinsic_b(tok)["DIR.END.PARALLEL"()];
>> }
>> 
> 
> Here we come to the interesting part: the hoisting of  "i32* I =
alloca 32” above the intrinsics required to update the intrinsics information
“QUAL.PRIVATE”.
> This means that the optimizer has to be aware of it, I’m missing the magic
here?
> I understand that an openmp specific optimization can do it, the question
is how it an openmp agnostic supposed to behave in face of
llvm.experimental.intrinsic_a?
> 
> —
> Mehdi
> 
> 
> 
>> 1. create i32* priv_val = alloca i32  %priv_i = ...in the region, and
replace all  %val with %prv_val in the region.
>> 2. perform outlining. 
>> 
>> Caller code
>> ....
>> omp_push_num_threads(4)
>> omp_fork_call( .... outline_par_region....) ....
>> 
>> Callee code: 
>> Outlined_par_rgion {
>>  I32* priv_val = alloca 32
>>  I32* priv_i = ....
>> 
>>   Ret
>> }
>> 
>> For OpenMP, we do support it at -O0, -O1, -O2 and -O3.  We had to make
sure it runs correctly w/ and w/o optimizations and advanced analysis. So we
need to preserve all source information for BE.
>> You can take a look our LLVM-HPC paper for a bit some details.  There
are still tons of work to be done. Thanks.
>> 
>> Xinmin 
>> 
>> -----Original Message-----
>> From: llvm-dev [mailto:llvm-dev-bounces at lists.llvm.org] On Behalf Of
Sanjoy Das via llvm-dev
>> Sent: Thursday, January 19, 2017 10:13 PM
>> To: Adve, Vikram Sadanand <vadve at illinois.edu>
>> Cc: llvm-dev <llvm-dev at lists.llvm.org>; llvm-dev-request at
lists.llvm.org
>> Subject: Re: [llvm-dev] [RFC] IR-level Region Annotations
>> 
>> Hi Vikram,
>> 
>> On Thu, Jan 19, 2017 at 9:27 PM, Adve, Vikram Sadanand <vadve at
illinois.edu> wrote:
>>> Hi Sanjoy,
>>> 
>>> Yes, that's exactly what we have been looking at recently here,
but
>>> the region tags seem to make it possible to express the control
flow
>>> as well, so I think we could start with reg ions+metadata, as Hal
and
>> 
>> I'm not yet convinced that region tags are sufficient to model
exotic control flow.
>> 
>> (I don't know OpenMP so this is a copy-pasted-edited example)
>> 
>> Say we have:
>> 
>> void main() {
>> #pragma omp parallel num_threads(4)
>> {
>>   int i = omp_get_thread_num();
>>   int val;
>>   compute_something_into_val(&val, i);
>>   a[i] = val;
>> }
>> }
>> 
>> I presume the (eventual) intended lowering is something like this (if
the intended lowering is different than this, and avoids the issue I'm
trying to highlight then my point is moot):
>> 
>> void main() {
>> tok = llvm.experimental.intrinsic_a();
>> 
>> int i = omp_get_thread_num();
>> i32* val = alloca i32
>> compute_something_into_val(val, i);
>> a[i] = val;
>> 
>> llvm.experimental.intrinsic_b(tok);
>> }
>> 
>> However, LLVM is free to hoist the alloca to the entry block:
>> 
>> void main() {
>> i32* val = alloca i32
>> tok = llvm.experimental.intrinsic_a();
>> 
>> int i = omp_get_thread_num();
>> compute_something_into_val(val, i);
>> a[i] = val;
>> 
>> llvm.experimental.intrinsic_b(tok);
>> }
>> 
>> and now you have a race between the four parallel forks.
>> 
>> The problem here is that nothing in the IR expresses that we have four
copies of the region running "at the same time".  In fact, such a
control flow is alien to LLVM today.
>> 
>> For instance, another evil optimization may turn:
>> 
>> void main() {
>> int a[4];
>> #pragma omp parallel num_threads(4)
>> {
>>   int i = omp_get_thread_num();
>>   int val = compute_something_into_val(i);
>>   a[i] = val;
>> }
>> 
>> return a[0] + a[1];
>> }
>> 
>> to
>> 
>> void main() {
>> int a[4];
>> #pragma omp parallel num_threads(4)
>> {
>>   int i = omp_get_thread_num();
>>   int val = compute_something_into_val(i);
>>   a[i] = val;
>> }
>> 
>> return undef;
>> }
>> 
>> since a[i] = val could have initialized at most one element in a.
>> 
>> Now you could say that the llvm.experimental.intrinsic_a and
llvm.experimental.intrinsic_b intrinsics are magic, and even such
"obvious" optimizations are not allowed to happen across them; but
then calls to these intrinsics is pretty fundamentally different from
"normal" calls, and are probably best modeled as new instructions.
>> You're going to have to do the same kind of auditing of passes
either way, and the only extra cost of a new instruction is the extra bitcode
reading / writing code.
>> 
>> I hope I made sense.
>> 
>> -- Sanjoy
>> 
>>> Xinmin proposed, and then figure out what needs to be first class 
>>> instructions.
>> 
>>> 
>>> --Vikram Adve
>>> 
>>> 
>>> 
>>>> On Jan 19, 2017, at 11:03 PM, Sanjoy Das <sanjoy at
playingwithpointers.com> wrote:
>>>> 
>>>> Hi,
>>>> 
>>>> My bias is to use both (b) and (d), since they have
complementary
>>>> strengths.  We should use (b) for expressing concepts that
can't be
>>>> semantically modeled as a call or invoke (this branch takes
both its
>>>> successors), and (d) for expressing things that can be (this
call may
>>>> never return), and annotation like things (this region (denoted
by
>>>> def-use of a token) is a reduction).
>>>> 
>>>> I don't grok OpenMP, but perhaps we can come with one or
two
>>>> "generalized control flow"-type instructions that can
be used to
>>>> model the non-call/invoke like semantics we'd like LLVM to
know
>>>> about, and model the rest with (d)?
>>>> 
>>>> -- Sanjoy
>>>> 
>>>> On Thu, Jan 19, 2017 at 8:28 PM, Hal Finkel via llvm-dev 
>>>> <llvm-dev at lists.llvm.org> wrote:
>>>>> 
>>>>> On 01/19/2017 03:36 PM, Mehdi Amini via llvm-dev wrote:
>>>>> 
>>>>> 
>>>>> On Jan 19, 2017, at 1:32 PM, Daniel Berlin <dberlin at
dberlin.org> wrote:
>>>>> 
>>>>> 
>>>>> 
>>>>>> On Thu, Jan 19, 2017 at 1:12 PM, Mehdi Amini
<mehdi.amini at apple.com> wrote:
>>>>>> 
>>>>>> 
>>>>>> On Jan 19, 2017, at 12:04 PM, Daniel Berlin <dberlin
at dberlin.org> wrote:
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Thu, Jan 19, 2017 at 11:46 AM, Mehdi Amini via
llvm-dev
>>>>>> <llvm-dev at lists.llvm.org> wrote:
>>>>>>> 
>>>>>>> 
>>>>>>>> On Jan 19, 2017, at 11:36 AM, Adve, Vikram
Sadanand via llvm-dev
>>>>>>>> <llvm-dev at lists.llvm.org> wrote:
>>>>>>>> 
>>>>>>>> Hi Johannes,
>>>>>>>> 
>>>>>>>>> I am especially curious where you get your
data from. Tapir [0]
>>>>>>>>> (and to some degree PIR [1]) have shown
that,
>>>>>>>>> counterintuitively, only a few changes to
LLVM passes are
>>>>>>>>> needed. Tapir was recently used in an MIT
class with a lot of
>>>>>>>>> students and it seemed to work well with
only minimal changes to
>>>>>>>>> analysis and especially transformation
passes.
>>>>>>>> 
>>>>>>>> TAPIR is an elegant, small extension and, in
particular, I think
>>>>>>>> the idea of asymmetric parallel tasks and
control flow is a
>>>>>>>> clever way to express parallelism with serial
semantics, as in
>>>>>>>> Cilk.  Encoding the control flow extensions as
explicit
>>>>>>>> instructions is orthogonal to that, though
arguably more elegant than using region tags + metadata.
>>>>>>>> 
>>>>>>>> However, Cilk is a tiny language compared with
the full
>>>>>>>> complexity of other languages, like OpenMP.  To
take just one
>>>>>>>> example, TAPIR cannot express the ORDERED
construct of OpenMP.  A
>>>>>>>> more serious concern, IMO, is that TAPIR (like
Cilk) requires
>>>>>>>> serial semantics, whereas there are many
parallel languages, OpenMP included, that do not obey that restriction.
>>>>>>>> Third, OpenMP has *numerous* clauses, e.g.,
REDUCTION or PRIVATE,
>>>>>>>> that are needed because without that, you’d be
dependent on
>>>>>>>> fundamentally hard compiler analyses to extract
the same
>>>>>>>> information for satisfactory parallel
performance; realistic
>>>>>>>> applications cannot depend on the success of
such analyses.
>>>>>>> 
>>>>>>> I agree with this, but I’m also wondering if it
needs to be first
>>>>>>> class in the IR?
>>>>>>> For example we know our alias analysis is very
basic, and C/C++
>>>>>>> have a higher constraint thanks to their type
system, but we
>>>>>>> didn’t inject this higher level information that
helps the
>>>>>>> optimizer as first class IR constructs.
>>>>>> 
>>>>>> 
>>>>>> FWIW, while i agree with the general point, i
wouldn't use this example.
>>>>>> Because we pretty much still suffer to this day because
of it (both
>>>>>> in AA, and devirt, and ...)  :) We can't always
even tell fields
>>>>>> apart
>>>>>> 
>>>>>> 
>>>>>> Is it inherent to the infrastructure, i.e. using
metadata instead
>>>>>> of first class IR construct or is it just a “quality of
implementation” issue?
>>>>> 
>>>>> Not to derail this conversation:
>>>>> 
>>>>> IMHO, At some point there is no real difference :)
>>>>> 
>>>>> Because otherwise, everything is a QOI issue.
>>>>> 
>>>>> IE if it's super tricky to get metadata that works well
and works
>>>>> right, doesn't get lost, etc, and that's inherent
to using metadata,
>>>>> that to me is not a QOI issue.
>>>>> 
>>>>> So could it be done with metadata? Probably?
>>>>> But at the same time,  if it had been done with more first
class
>>>>> constructs, it would have happened years ago  and been much
lower cost.
>>>>> 
>>>>> 
>>>>> This is what I meant by “inherent to the infrastructure”,
thanks for
>>>>> clarifying.
>>>>> 
>>>>> 
>>>>> To clarify, we were proposing metadata that is used as
arguments to
>>>>> the region-annotation intrinsics. This metadata has the
nice
>>>>> property that it does not get dropped (so it is just being
used as a
>>>>> way of encoding whatever data structures are necessary
without predefining a syntactic schema).
>>>>> 
>>>>> -Hal
>>>>> 
>>>>> 
>>>>> —
>>>>> Mehdi
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> _______________________________________________
>>>>> LLVM Developers mailing list
>>>>> llvm-dev at lists.llvm.org
>>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>> 
>>>>> 
>>>>> --
>>>>> Hal Finkel
>>>>> Lead, Compiler Technology and Programming Languages
Leadership
>>>>> Computing Facility Argonne National Laboratory
>>>>> 
>>>>> 
>>>>> _______________________________________________
>>>>> LLVM Developers mailing list
>>>>> llvm-dev at lists.llvm.org
>>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>> 
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>

Mehdi Amini via llvm-dev

2017-Feb-01 03:30 UTC

head link

[llvm-dev] [RFC] IR-level Region Annotations

Sent from my iPhone
> On Jan 31, 2017, at 7:27 PM, Tian, Xinmin <xinmin.tian at intel.com>
wrote:
> 
> Remember that,  the prepare-phase is invoked in the FE or right after FE,
inlining is not happening, that is why we don't call it "pass". 
Chandler made a good point for this case a long time back.
> 
What I was describing is the inlining in the optimizer pipeline.


> Hoisting alloca is totally ok.  A new alloca is generated during outlining
later on for anything marked as "private" (so long the
"private" information is saves in the tag). I thought we talked this
in an early email.

Can you describe how (and at which point) you get the private for
"var"  added to the tag?


-- 
Mehdi
> 
> By the way, all concerns you have are all valid, we had worked on resolving
these issues 10+ years back when we did similar things in our compilers. I
wouldn't claim we have perfect solutions, but we do reasonable good
solutions for handling general directives and openmp directives.
> 
> Xinmin
> 
> -----Original Message-----
> From: mehdi.amini at apple.com [mailto:mehdi.amini at apple.com] 
> Sent: Tuesday, January 31, 2017 7:08 PM
> To: Tian, Xinmin <xinmin.tian at intel.com>
> Cc: Sanjoy Das <sanjoy at playingwithpointers.com>; Adve, Vikram
Sadanand <vadve at illinois.edu>; llvm-dev at lists.llvm.org;
llvm-dev-request at lists.llvm.org
> Subject: Re: [llvm-dev] [RFC] IR-level Region Annotations
> 
> 
>> On Jan 31, 2017, at 6:48 PM, Tian, Xinmin <xinmin.tian at
intel.com> wrote:
>> 
>> Let me try this. 
>> 
>> You can simply consider the prepare-phase (e.g. pre-privatization) 
were done in FE  (actually a library can be used by multiple FEs at LLVM IR
level),  the region is run with 1 thread,  region annotation (scope,
single-entry-single-exit) as memory barrier conservatively for now (instead of
checking individual memory dependency, aliasing via tags which is the actual
implementation is done) marked with region intrinsic functions. What
optimization will mess up  with this region-annotation?
> 
> The first thing that comes to my mind is inlining that can put the IR in a
form that breaks the invariant you tried to enforce with your
"prepare-phase” (for example by hoisting an allocas).
> 
> — 
> Mehdi
> 
> 
>> 
>> 
>> -----Original Message-----
>> From: mehdi.amini at apple.com [mailto:mehdi.amini at apple.com] 
>> Sent: Tuesday, January 31, 2017 5:47 PM
>> To: Tian, Xinmin <xinmin.tian at intel.com>
>> Cc: Sanjoy Das <sanjoy at playingwithpointers.com>; Adve, Vikram
Sadanand <vadve at illinois.edu>; llvm-dev at lists.llvm.org;
llvm-dev-request at lists.llvm.org
>> Subject: Re: [llvm-dev] [RFC] IR-level Region Annotations
>> 
>> 
>>> On Jan 31, 2017, at 5:38 PM, Tian, Xinmin <xinmin.tian at
intel.com> wrote:
>>> 
>>>>>>> Ok, but this looks like a “workaround" for
your specific use-case, I don’t see how it can scale as a model-agnostic and
general-purpose region semantic.
>>> 
>>> I would say it is a design trade-off.
>> 
>> I’m not sure if we’re talking about the same thing here: my
understanding at this point is that the design trading-off you’re making
“simplicity” by scarifying “correctness”.
>> 
>> Requiring the IR to stay in what you’re calling a “canonical” form in
your answer to Sanjoy in order to not miscompile a program is not an approach
that seems compatible with how we deal with the IR usually.
>> 
>>> Regardless it is a new instruction or an intrinsics with token/tag,
it will consist of  model-agnostic part and model-non-agnostic part. The package
comes with a framework for parsing and using these intrinsics.  See the reply I
had for Sanjoy's email.
>> 
>> The answer to Sanjoy is not really helpful to clarify anything to me.
At this point I still don’t understand how this is supposed to be correct in
general.
>> 
>> It would be helpful to have a LangRef patch that describes the semantic
associated to your region intrinsics. Then we may be able to process some
examples through the formalized description.
>> 
>> — 
>> Mehdi
>> 
>> 
>>> 
>>> -----Original Message-----
>>> From: mehdi.amini at apple.com [mailto:mehdi.amini at apple.com] 
>>> Sent: Saturday, January 21, 2017 1:57 PM
>>> To: Tian, Xinmin <xinmin.tian at intel.com>
>>> Cc: Sanjoy Das <sanjoy at playingwithpointers.com>; Adve,
Vikram Sadanand <vadve at illinois.edu>; llvm-dev at lists.llvm.org;
llvm-dev-request at lists.llvm.org
>>> Subject: Re: [llvm-dev] [RFC] IR-level Region Annotations
>>> 
>>> 
>>>> On Jan 20, 2017, at 11:17 AM, Tian, Xinmin <xinmin.tian at
intel.com> wrote:
>>>> 
>>>>>>>> This means that the optimizer has to be aware
of it, I’m missing the magic here?
>>>> 
>>>> This is one option.  
>>>> 
>>>> The another option is that, as I mentioned in our LLVM-HPC
paper in our implementation. We have a "prepare phase for
pre-privatization" can be invoked by both Clang FE and Fortran FE right
after LLVM IR is generated. So, in this way, we are able to minimize the
optimizations impact for the original val and I
>>> 
>>> Ok, but this looks like a “workaround" for your specific
use-case, I don’t see how it can scale as a model-agnostic and general-purpose
region semantic.
>>> 
>>> The fact that you needed this pre-step in the first place seems to
indicate to me that it confirms what multiple people expressed in this thread,
for example what Daniel wrote here:
http://lists.llvm.org/pipermail/llvm-dev/2017-January/108997.html
>>> 
>>> — 
>>> Mehdi
>>> 
>>> 
>>> 
>>>> 
>>>> { void main() {
>>>> i32* val = alloca i32
>>>> i32* I = alloca 32
>>>> i32* priv_val = alloca i32
>>>> i32* priv_i alloca 32
>>>> tok = llvm.experimental.intrinsic_a()[
"DIR.PARALLEL"(),"QUAL.PRIVATE"(i32* %priv_val, i32
%priv_i), "QUAL.NUM_THREADS"(i32 4)]
>>>> 
>>>> %priv_i = omp_get_thread_num();
>>>> compute_something_into_val(%priv_val, %priv_i);  
>>>> a[priv_i] = %priv_val;
>>>> 
>>>>
llvm.experimental.intrinsic_b(tok)["DIR.END.PARALLEL"()];
>>>> ....
>>>> 
>>>> I = 
>>>> Val >>>> Foo(val, i). 
>>>> }
>>>> 
>>>> "Prepare phase" is our way of minimizing the impact
to existing optimizations.
>>>> 
>>>> Xinmin
>>>> 
>>>> -----Original Message-----
>>>> From: mehdi.amini at apple.com [mailto:mehdi.amini at
apple.com]
>>>> Sent: Friday, January 20, 2017 10:54 AM
>>>> To: Tian, Xinmin
>>>> Cc: Sanjoy Das; Adve, Vikram Sadanand; llvm-dev at
lists.llvm.org; llvm-dev-request at lists.llvm.org
>>>> Subject: Re: [llvm-dev] [RFC] IR-level Region Annotations
>>>> 
>>>> 
>>>>> On Jan 20, 2017, at 10:44 AM, Tian, Xinmin via llvm-dev
<llvm-dev at lists.llvm.org> wrote:
>>>>> 
>>>>> Sanjoy, the IR would be like something below. It is ok to
hoist alloca instruction outside the region. There are some small changes in
optimizer to understand region-annotation intrinsic.
>>>>> 
>>>>> { void main() {
>>>>> i32* val = alloca i32
>>>>> tok = llvm.experimental.intrinsic_a()[
"DIR.PARALLEL"(),
>>>>> "QUAL.PRIVATE"(i32* val),
"QUAL.NUM_THREADS"(i32 4)]
>>>>> 
>>>>> int i = omp_get_thread_num();
>>>>> compute_something_into_val(val, i);
>>>>> a[i] = val;
>>>>> 
>>>>>
llvm.experimental.intrinsic_b(tok)["DIR.END.PARALLEL"()];
>>>>> }
>>>>> 
>>>>> With above representation, we can do privatization and
outlining as
>>>>> below
>>>>> 
>>>>> { void main() {
>>>>> i32* val = alloca i32
>>>>> i32* I = alloca 32
>>>>> tok = llvm.experimental.intrinsic_a()[
"DIR.PARALLEL"(),
>>>>> "QUAL.PRIVATE"(i32* %val, i32 %i),
"QUAL.NUM_THREADS"(i32 4)]
>>>>> 
>>>>> %ii = omp_get_thread_num();
>>>>> compute_something_into_val(%val, %i);  a[i] = %val;
>>>>> 
>>>>>
llvm.experimental.intrinsic_b(tok)["DIR.END.PARALLEL"()];
>>>>> }
>>>>> 
>>>> 
>>>> Here we come to the interesting part: the hoisting of 
"i32* I = alloca 32” above the intrinsics required to update the intrinsics
information “QUAL.PRIVATE”.
>>>> This means that the optimizer has to be aware of it, I’m
missing the magic here?
>>>> I understand that an openmp specific optimization can do it,
the question is how it an openmp agnostic supposed to behave in face of
llvm.experimental.intrinsic_a?
>>>> 
>>>> —
>>>> Mehdi
>>>> 
>>>> 
>>>> 
>>>>> 1. create i32* priv_val = alloca i32  %priv_i = ...in the
region, and replace all  %val with %prv_val in the region.
>>>>> 2. perform outlining. 
>>>>> 
>>>>> Caller code
>>>>> ....
>>>>> omp_push_num_threads(4)
>>>>> omp_fork_call( .... outline_par_region....) ....
>>>>> 
>>>>> Callee code: 
>>>>> Outlined_par_rgion {
>>>>> I32* priv_val = alloca 32
>>>>> I32* priv_i = ....
>>>>> 
>>>>> Ret
>>>>> }
>>>>> 
>>>>> For OpenMP, we do support it at -O0, -O1, -O2 and -O3.  We
had to make sure it runs correctly w/ and w/o optimizations and advanced
analysis. So we need to preserve all source information for BE.
>>>>> You can take a look our LLVM-HPC paper for a bit some
details.  There are still tons of work to be done. Thanks.
>>>>> 
>>>>> Xinmin 
>>>>> 
>>>>> -----Original Message-----
>>>>> From: llvm-dev [mailto:llvm-dev-bounces at lists.llvm.org]
On Behalf Of Sanjoy Das via llvm-dev
>>>>> Sent: Thursday, January 19, 2017 10:13 PM
>>>>> To: Adve, Vikram Sadanand <vadve at illinois.edu>
>>>>> Cc: llvm-dev <llvm-dev at lists.llvm.org>;
llvm-dev-request at lists.llvm.org
>>>>> Subject: Re: [llvm-dev] [RFC] IR-level Region Annotations
>>>>> 
>>>>> Hi Vikram,
>>>>> 
>>>>>> On Thu, Jan 19, 2017 at 9:27 PM, Adve, Vikram Sadanand
<vadve at illinois.edu> wrote:
>>>>>> Hi Sanjoy,
>>>>>> 
>>>>>> Yes, that's exactly what we have been looking at
recently here, but
>>>>>> the region tags seem to make it possible to express the
control flow
>>>>>> as well, so I think we could start with reg
ions+metadata, as Hal and
>>>>> 
>>>>> I'm not yet convinced that region tags are sufficient
to model exotic control flow.
>>>>> 
>>>>> (I don't know OpenMP so this is a copy-pasted-edited
example)
>>>>> 
>>>>> Say we have:
>>>>> 
>>>>> void main() {
>>>>> #pragma omp parallel num_threads(4)
>>>>> {
>>>>> int i = omp_get_thread_num();
>>>>> int val;
>>>>> compute_something_into_val(&val, i);
>>>>> a[i] = val;
>>>>> }
>>>>> }
>>>>> 
>>>>> I presume the (eventual) intended lowering is something
like this (if the intended lowering is different than this, and avoids the issue
I'm trying to highlight then my point is moot):
>>>>> 
>>>>> void main() {
>>>>> tok = llvm.experimental.intrinsic_a();
>>>>> 
>>>>> int i = omp_get_thread_num();
>>>>> i32* val = alloca i32
>>>>> compute_something_into_val(val, i);
>>>>> a[i] = val;
>>>>> 
>>>>> llvm.experimental.intrinsic_b(tok);
>>>>> }
>>>>> 
>>>>> However, LLVM is free to hoist the alloca to the entry
block:
>>>>> 
>>>>> void main() {
>>>>> i32* val = alloca i32
>>>>> tok = llvm.experimental.intrinsic_a();
>>>>> 
>>>>> int i = omp_get_thread_num();
>>>>> compute_something_into_val(val, i);
>>>>> a[i] = val;
>>>>> 
>>>>> llvm.experimental.intrinsic_b(tok);
>>>>> }
>>>>> 
>>>>> and now you have a race between the four parallel forks.
>>>>> 
>>>>> The problem here is that nothing in the IR expresses that
we have four copies of the region running "at the same time".  In
fact, such a control flow is alien to LLVM today.
>>>>> 
>>>>> For instance, another evil optimization may turn:
>>>>> 
>>>>> void main() {
>>>>> int a[4];
>>>>> #pragma omp parallel num_threads(4)
>>>>> {
>>>>> int i = omp_get_thread_num();
>>>>> int val = compute_something_into_val(i);
>>>>> a[i] = val;
>>>>> }
>>>>> 
>>>>> return a[0] + a[1];
>>>>> }
>>>>> 
>>>>> to
>>>>> 
>>>>> void main() {
>>>>> int a[4];
>>>>> #pragma omp parallel num_threads(4)
>>>>> {
>>>>> int i = omp_get_thread_num();
>>>>> int val = compute_something_into_val(i);
>>>>> a[i] = val;
>>>>> }
>>>>> 
>>>>> return undef;
>>>>> }
>>>>> 
>>>>> since a[i] = val could have initialized at most one element
in a.
>>>>> 
>>>>> Now you could say that the llvm.experimental.intrinsic_a
and llvm.experimental.intrinsic_b intrinsics are magic, and even such
"obvious" optimizations are not allowed to happen across them; but
then calls to these intrinsics is pretty fundamentally different from
"normal" calls, and are probably best modeled as new instructions.
>>>>> You're going to have to do the same kind of auditing of
passes either way, and the only extra cost of a new instruction is the extra
bitcode reading / writing code.
>>>>> 
>>>>> I hope I made sense.
>>>>> 
>>>>> -- Sanjoy
>>>>> 
>>>>>> Xinmin proposed, and then figure out what needs to be
first class
>>>>>> instructions.
>>>>> 
>>>>>> 
>>>>>> --Vikram Adve
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>> On Jan 19, 2017, at 11:03 PM, Sanjoy Das <sanjoy
at playingwithpointers.com> wrote:
>>>>>>> 
>>>>>>> Hi,
>>>>>>> 
>>>>>>> My bias is to use both (b) and (d), since they have
complementary
>>>>>>> strengths.  We should use (b) for expressing
concepts that can't be
>>>>>>> semantically modeled as a call or invoke (this
branch takes both its
>>>>>>> successors), and (d) for expressing things that can
be (this call may
>>>>>>> never return), and annotation like things (this
region (denoted by
>>>>>>> def-use of a token) is a reduction).
>>>>>>> 
>>>>>>> I don't grok OpenMP, but perhaps we can come
with one or two
>>>>>>> "generalized control flow"-type
instructions that can be used to
>>>>>>> model the non-call/invoke like semantics we'd
like LLVM to know
>>>>>>> about, and model the rest with (d)?
>>>>>>> 
>>>>>>> -- Sanjoy
>>>>>>> 
>>>>>>> On Thu, Jan 19, 2017 at 8:28 PM, Hal Finkel via
llvm-dev
>>>>>>> <llvm-dev at lists.llvm.org> wrote:
>>>>>>>> 
>>>>>>>> On 01/19/2017 03:36 PM, Mehdi Amini via
llvm-dev wrote:
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Jan 19, 2017, at 1:32 PM, Daniel Berlin
<dberlin at dberlin.org> wrote:
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> On Thu, Jan 19, 2017 at 1:12 PM, Mehdi
Amini <mehdi.amini at apple.com> wrote:
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Jan 19, 2017, at 12:04 PM, Daniel Berlin
<dberlin at dberlin.org> wrote:
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Thu, Jan 19, 2017 at 11:46 AM, Mehdi
Amini via llvm-dev
>>>>>>>>> <llvm-dev at lists.llvm.org> wrote:
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>> On Jan 19, 2017, at 11:36 AM, Adve,
Vikram Sadanand via llvm-dev
>>>>>>>>>>> <llvm-dev at lists.llvm.org>
wrote:
>>>>>>>>>>> 
>>>>>>>>>>> Hi Johannes,
>>>>>>>>>>> 
>>>>>>>>>>>> I am especially curious where
you get your data from. Tapir [0]
>>>>>>>>>>>> (and to some degree PIR [1])
have shown that,
>>>>>>>>>>>> counterintuitively, only a few
changes to LLVM passes are
>>>>>>>>>>>> needed. Tapir was recently used
in an MIT class with a lot of
>>>>>>>>>>>> students and it seemed to work
well with only minimal changes to
>>>>>>>>>>>> analysis and especially
transformation passes.
>>>>>>>>>>> 
>>>>>>>>>>> TAPIR is an elegant, small
extension and, in particular, I think
>>>>>>>>>>> the idea of asymmetric parallel
tasks and control flow is a
>>>>>>>>>>> clever way to express parallelism
with serial semantics, as in
>>>>>>>>>>> Cilk.  Encoding the control flow
extensions as explicit
>>>>>>>>>>> instructions is orthogonal to that,
though arguably more elegant than using region tags + metadata.
>>>>>>>>>>> 
>>>>>>>>>>> However, Cilk is a tiny language
compared with the full
>>>>>>>>>>> complexity of other languages, like
OpenMP.  To take just one
>>>>>>>>>>> example, TAPIR cannot express the
ORDERED construct of OpenMP.  A
>>>>>>>>>>> more serious concern, IMO, is that
TAPIR (like Cilk) requires
>>>>>>>>>>> serial semantics, whereas there are
many parallel languages, OpenMP included, that do not obey that restriction.
>>>>>>>>>>> Third, OpenMP has *numerous*
clauses, e.g., REDUCTION or PRIVATE,
>>>>>>>>>>> that are needed because without
that, you’d be dependent on
>>>>>>>>>>> fundamentally hard compiler
analyses to extract the same
>>>>>>>>>>> information for satisfactory
parallel performance; realistic
>>>>>>>>>>> applications cannot depend on the
success of such analyses.
>>>>>>>>>> 
>>>>>>>>>> I agree with this, but I’m also
wondering if it needs to be first
>>>>>>>>>> class in the IR?
>>>>>>>>>> For example we know our alias analysis
is very basic, and C/C++
>>>>>>>>>> have a higher constraint thanks to
their type system, but we
>>>>>>>>>> didn’t inject this higher level
information that helps the
>>>>>>>>>> optimizer as first class IR constructs.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> FWIW, while i agree with the general point,
i wouldn't use this example.
>>>>>>>>> Because we pretty much still suffer to this
day because of it (both
>>>>>>>>> in AA, and devirt, and ...)  :) We
can't always even tell fields
>>>>>>>>> apart
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Is it inherent to the infrastructure, i.e.
using metadata instead
>>>>>>>>> of first class IR construct or is it just a
“quality of implementation” issue?
>>>>>>>> 
>>>>>>>> Not to derail this conversation:
>>>>>>>> 
>>>>>>>> IMHO, At some point there is no real difference
:)
>>>>>>>> 
>>>>>>>> Because otherwise, everything is a QOI issue.
>>>>>>>> 
>>>>>>>> IE if it's super tricky to get metadata
that works well and works
>>>>>>>> right, doesn't get lost, etc, and
that's inherent to using metadata,
>>>>>>>> that to me is not a QOI issue.
>>>>>>>> 
>>>>>>>> So could it be done with metadata? Probably?
>>>>>>>> But at the same time,  if it had been done with
more first class
>>>>>>>> constructs, it would have happened years ago 
and been much lower cost.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> This is what I meant by “inherent to the
infrastructure”, thanks for
>>>>>>>> clarifying.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> To clarify, we were proposing metadata that is
used as arguments to
>>>>>>>> the region-annotation intrinsics. This metadata
has the nice
>>>>>>>> property that it does not get dropped (so it is
just being used as a
>>>>>>>> way of encoding whatever data structures are
necessary without predefining a syntactic schema).
>>>>>>>> 
>>>>>>>> -Hal
>>>>>>>> 
>>>>>>>> 
>>>>>>>> —
>>>>>>>> Mehdi
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> _______________________________________________
>>>>>>>> LLVM Developers mailing list
>>>>>>>> llvm-dev at lists.llvm.org
>>>>>>>>
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>>>>> 
>>>>>>>> 
>>>>>>>> --
>>>>>>>> Hal Finkel
>>>>>>>> Lead, Compiler Technology and Programming
Languages Leadership
>>>>>>>> Computing Facility Argonne National Laboratory
>>>>>>>> 
>>>>>>>> 
>>>>>>>> _______________________________________________
>>>>>>>> LLVM Developers mailing list
>>>>>>>> llvm-dev at lists.llvm.org
>>>>>>>>
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>>>>> 
>>>>> _______________________________________________
>>>>> LLVM Developers mailing list
>>>>> llvm-dev at lists.llvm.org
>>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>> _______________________________________________
>>>>> LLVM Developers mailing list
>>>>> llvm-dev at lists.llvm.org
>>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>> 
>>> 
>> 
> -------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170131/f6820515/attachment-0001.html>

Tian, Xinmin via llvm-dev

2017-Feb-01 03:53 UTC

head link

[llvm-dev] [RFC] IR-level Region Annotations

In this case, inliner is educated to add all local variables to the tag of
enclosing parallel region, if there is enclosing parallel region.

In our icc implementation, it is even simple, as we have routine level symbol
table, the inliner adds  ”private”  attribute to those local variables w/o
checking enclosing scope, the parallelizer does check and use it.

Xinmin

From: mehdi.amini at apple.com [mailto:mehdi.amini at apple.com]
Sent: Tuesday, January 31, 2017 7:31 PM
To: Tian, Xinmin <xinmin.tian at intel.com>
Cc: Sanjoy Das <sanjoy at playingwithpointers.com>; Adve, Vikram Sadanand
<vadve at illinois.edu>; llvm-dev at lists.llvm.org; llvm-dev-request at
lists.llvm.org
Subject: Re: [llvm-dev] [RFC] IR-level Region Annotations



Sent from my iPhone

On Jan 31, 2017, at 7:27 PM, Tian, Xinmin <xinmin.tian at
intel.com<mailto:xinmin.tian at intel.com>> wrote:
Remember that,  the prepare-phase is invoked in the FE or right after FE,
inlining is not happening, that is why we don't call it "pass". 
Chandler made a good point for this case a long time back.


What I was describing is the inlining in the optimizer pipeline.



Hoisting alloca is totally ok.  A new alloca is generated during outlining later
on for anything marked as "private" (so long the "private"
information is saves in the tag). I thought we talked this in an early email.


Can you describe how (and at which point) you get the private for
"var"  added to the tag?


--
Mehdi


By the way, all concerns you have are all valid, we had worked on resolving
these issues 10+ years back when we did similar things in our compilers. I
wouldn't claim we have perfect solutions, but we do reasonable good
solutions for handling general directives and openmp directives.

Xinmin

-----Original Message-----
From: mehdi.amini at apple.com<mailto:mehdi.amini at apple.com>
[mailto:mehdi.amini at apple.com]
Sent: Tuesday, January 31, 2017 7:08 PM
To: Tian, Xinmin <xinmin.tian at intel.com<mailto:xinmin.tian at
intel.com>>
Cc: Sanjoy Das <sanjoy at playingwithpointers.com<mailto:sanjoy at
playingwithpointers.com>>; Adve, Vikram Sadanand <vadve at
illinois.edu<mailto:vadve at illinois.edu>>; llvm-dev at
lists.llvm.org<mailto:llvm-dev at lists.llvm.org>; llvm-dev-request at
lists.llvm.org<mailto:llvm-dev-request at lists.llvm.org>
Subject: Re: [llvm-dev] [RFC] IR-level Region Annotations



On Jan 31, 2017, at 6:48 PM, Tian, Xinmin <xinmin.tian at
intel.com<mailto:xinmin.tian at intel.com>> wrote:

Let me try this.

You can simply consider the prepare-phase (e.g. pre-privatization)  were done in
FE  (actually a library can be used by multiple FEs at LLVM IR level),  the
region is run with 1 thread,  region annotation (scope,
single-entry-single-exit) as memory barrier conservatively for now (instead of
checking individual memory dependency, aliasing via tags which is the actual
implementation is done) marked with region intrinsic functions. What
optimization will mess up  with this region-annotation?

The first thing that comes to my mind is inlining that can put the IR in a form
that breaks the invariant you tried to enforce with your "prepare-phase”
(for example by hoisting an allocas).

—
Mehdi





-----Original Message-----
From: mehdi.amini at apple.com<mailto:mehdi.amini at apple.com>
[mailto:mehdi.amini at apple.com]
Sent: Tuesday, January 31, 2017 5:47 PM
To: Tian, Xinmin <xinmin.tian at intel.com<mailto:xinmin.tian at
intel.com>>
Cc: Sanjoy Das <sanjoy at playingwithpointers.com<mailto:sanjoy at
playingwithpointers.com>>; Adve, Vikram Sadanand <vadve at
illinois.edu<mailto:vadve at illinois.edu>>; llvm-dev at
lists.llvm.org<mailto:llvm-dev at lists.llvm.org>; llvm-dev-request at
lists.llvm.org<mailto:llvm-dev-request at lists.llvm.org>
Subject: Re: [llvm-dev] [RFC] IR-level Region Annotations


On Jan 31, 2017, at 5:38 PM, Tian, Xinmin <xinmin.tian at
intel.com<mailto:xinmin.tian at intel.com>> wrote:

Ok, but this looks like a “workaround" for your specific use-case, I don’t
see how it can scale as a model-agnostic and general-purpose region semantic.

I would say it is a design trade-off.

I’m not sure if we’re talking about the same thing here: my understanding at
this point is that the design trading-off you’re making “simplicity” by
scarifying “correctness”.

Requiring the IR to stay in what you’re calling a “canonical” form in your
answer to Sanjoy in order to not miscompile a program is not an approach that
seems compatible with how we deal with the IR usually.

Regardless it is a new instruction or an intrinsics with token/tag, it will
consist of  model-agnostic part and model-non-agnostic part. The package comes
with a framework for parsing and using these intrinsics.  See the reply I had
for Sanjoy's email.

The answer to Sanjoy is not really helpful to clarify anything to me. At this
point I still don’t understand how this is supposed to be correct in general.

It would be helpful to have a LangRef patch that describes the semantic
associated to your region intrinsics. Then we may be able to process some
examples through the formalized description.

—
Mehdi



-----Original Message-----
From: mehdi.amini at apple.com<mailto:mehdi.amini at apple.com>
[mailto:mehdi.amini at apple.com]
Sent: Saturday, January 21, 2017 1:57 PM
To: Tian, Xinmin <xinmin.tian at intel.com<mailto:xinmin.tian at
intel.com>>
Cc: Sanjoy Das <sanjoy at playingwithpointers.com<mailto:sanjoy at
playingwithpointers.com>>; Adve, Vikram Sadanand <vadve at
illinois.edu<mailto:vadve at illinois.edu>>; llvm-dev at
lists.llvm.org<mailto:llvm-dev at lists.llvm.org>; llvm-dev-request at
lists.llvm.org<mailto:llvm-dev-request at lists.llvm.org>
Subject: Re: [llvm-dev] [RFC] IR-level Region Annotations


On Jan 20, 2017, at 11:17 AM, Tian, Xinmin <xinmin.tian at
intel.com<mailto:xinmin.tian at intel.com>> wrote:

This means that the optimizer has to be aware of it, I’m missing the magic here?

This is one option.

The another option is that, as I mentioned in our LLVM-HPC paper in our
implementation. We have a "prepare phase for pre-privatization" can be
invoked by both Clang FE and Fortran FE right after LLVM IR is generated. So, in
this way, we are able to minimize the optimizations impact for the original val
and I

Ok, but this looks like a “workaround" for your specific use-case, I don’t
see how it can scale as a model-agnostic and general-purpose region semantic.

The fact that you needed this pre-step in the first place seems to indicate to
me that it confirms what multiple people expressed in this thread, for example
what Daniel wrote here:
http://lists.llvm.org/pipermail/llvm-dev/2017-January/108997.html

—
Mehdi




{ void main() {
i32* val = alloca i32
i32* I = alloca 32
i32* priv_val = alloca i32
i32* priv_i alloca 32
tok = llvm.experimental.intrinsic_a()[
"DIR.PARALLEL"(),"QUAL.PRIVATE"(i32* %priv_val, i32
%priv_i), "QUAL.NUM_THREADS"(i32 4)]

%priv_i = omp_get_thread_num();
compute_something_into_val(%priv_val, %priv_i);
a[priv_i] = %priv_val;

llvm.experimental.intrinsic_b(tok)["DIR.END.PARALLEL"()];
....

I Val Foo(val, i).
}

"Prepare phase" is our way of minimizing the impact to existing
optimizations.

Xinmin

-----Original Message-----
From: mehdi.amini at apple.com<mailto:mehdi.amini at apple.com>
[mailto:mehdi.amini at apple.com]
Sent: Friday, January 20, 2017 10:54 AM
To: Tian, Xinmin
Cc: Sanjoy Das; Adve, Vikram Sadanand; llvm-dev at
lists.llvm.org<mailto:llvm-dev at lists.llvm.org>; llvm-dev-request at
lists.llvm.org<mailto:llvm-dev-request at lists.llvm.org>
Subject: Re: [llvm-dev] [RFC] IR-level Region Annotations


On Jan 20, 2017, at 10:44 AM, Tian, Xinmin via llvm-dev <llvm-dev at
lists.llvm.org<mailto:llvm-dev at lists.llvm.org>> wrote:

Sanjoy, the IR would be like something below. It is ok to hoist alloca
instruction outside the region. There are some small changes in optimizer to
understand region-annotation intrinsic.

{ void main() {
i32* val = alloca i32
tok = llvm.experimental.intrinsic_a()[ "DIR.PARALLEL"(),
"QUAL.PRIVATE"(i32* val), "QUAL.NUM_THREADS"(i32 4)]

int i = omp_get_thread_num();
compute_something_into_val(val, i);
a[i] = val;

llvm.experimental.intrinsic_b(tok)["DIR.END.PARALLEL"()];
}

With above representation, we can do privatization and outlining as
below

{ void main() {
i32* val = alloca i32
i32* I = alloca 32
tok = llvm.experimental.intrinsic_a()[ "DIR.PARALLEL"(),
"QUAL.PRIVATE"(i32* %val, i32 %i), "QUAL.NUM_THREADS"(i32
4)]

%ii = omp_get_thread_num();
compute_something_into_val(%val, %i);  a[i] = %val;

llvm.experimental.intrinsic_b(tok)["DIR.END.PARALLEL"()];
}


Here we come to the interesting part: the hoisting of  "i32* I = alloca 32”
above the intrinsics required to update the intrinsics information
“QUAL.PRIVATE”.
This means that the optimizer has to be aware of it, I’m missing the magic here?
I understand that an openmp specific optimization can do it, the question is how
it an openmp agnostic supposed to behave in face of
llvm.experimental.intrinsic_a?

—
Mehdi



1. create i32* priv_val = alloca i32  %priv_i = ...in the region, and replace
all  %val with %prv_val in the region.
2. perform outlining.

Caller code
....
omp_push_num_threads(4)
omp_fork_call( .... outline_par_region....) ....

Callee code:
Outlined_par_rgion {
I32* priv_val = alloca 32
I32* priv_i = ....

Ret
}

For OpenMP, we do support it at -O0, -O1, -O2 and -O3.  We had to make sure it
runs correctly w/ and w/o optimizations and advanced analysis. So we need to
preserve all source information for BE.
You can take a look our LLVM-HPC paper for a bit some details.  There are still
tons of work to be done. Thanks.

Xinmin

-----Original Message-----
From: llvm-dev [mailto:llvm-dev-bounces at lists.llvm.org] On Behalf Of Sanjoy
Das via llvm-dev
Sent: Thursday, January 19, 2017 10:13 PM
To: Adve, Vikram Sadanand <vadve at illinois.edu<mailto:vadve at
illinois.edu>>
Cc: llvm-dev <llvm-dev at lists.llvm.org<mailto:llvm-dev at
lists.llvm.org>>; llvm-dev-request at
lists.llvm.org<mailto:llvm-dev-request at lists.llvm.org>
Subject: Re: [llvm-dev] [RFC] IR-level Region Annotations

Hi Vikram,

On Thu, Jan 19, 2017 at 9:27 PM, Adve, Vikram Sadanand <vadve at
illinois.edu<mailto:vadve at illinois.edu>> wrote:
Hi Sanjoy,

Yes, that's exactly what we have been looking at recently here, but
the region tags seem to make it possible to express the control flow
as well, so I think we could start with reg ions+metadata, as Hal and

I'm not yet convinced that region tags are sufficient to model exotic
control flow.

(I don't know OpenMP so this is a copy-pasted-edited example)

Say we have:

void main() {
#pragma omp parallel num_threads(4)
{
int i = omp_get_thread_num();
int val;
compute_something_into_val(&val, i);
a[i] = val;
}
}

I presume the (eventual) intended lowering is something like this (if the
intended lowering is different than this, and avoids the issue I'm trying to
highlight then my point is moot):

void main() {
tok = llvm.experimental.intrinsic_a();

int i = omp_get_thread_num();
i32* val = alloca i32
compute_something_into_val(val, i);
a[i] = val;

llvm.experimental.intrinsic_b(tok);
}

However, LLVM is free to hoist the alloca to the entry block:

void main() {
i32* val = alloca i32
tok = llvm.experimental.intrinsic_a();

int i = omp_get_thread_num();
compute_something_into_val(val, i);
a[i] = val;

llvm.experimental.intrinsic_b(tok);
}

and now you have a race between the four parallel forks.

The problem here is that nothing in the IR expresses that we have four copies of
the region running "at the same time".  In fact, such a control flow
is alien to LLVM today.

For instance, another evil optimization may turn:

void main() {
int a[4];
#pragma omp parallel num_threads(4)
{
int i = omp_get_thread_num();
int val = compute_something_into_val(i);
a[i] = val;
}

return a[0] + a[1];
}

to

void main() {
int a[4];
#pragma omp parallel num_threads(4)
{
int i = omp_get_thread_num();
int val = compute_something_into_val(i);
a[i] = val;
}

return undef;
}

since a[i] = val could have initialized at most one element in a.

Now you could say that the llvm.experimental.intrinsic_a and
llvm.experimental.intrinsic_b intrinsics are magic, and even such
"obvious" optimizations are not allowed to happen across them; but
then calls to these intrinsics is pretty fundamentally different from
"normal" calls, and are probably best modeled as new instructions.
You're going to have to do the same kind of auditing of passes either way,
and the only extra cost of a new instruction is the extra bitcode reading /
writing code.

I hope I made sense.

-- Sanjoy

Xinmin proposed, and then figure out what needs to be first class
instructions.


--Vikram Adve



On Jan 19, 2017, at 11:03 PM, Sanjoy Das <sanjoy at
playingwithpointers.com<mailto:sanjoy at playingwithpointers.com>>
wrote:

Hi,

My bias is to use both (b) and (d), since they have complementary
strengths.  We should use (b) for expressing concepts that can't be
semantically modeled as a call or invoke (this branch takes both its
successors), and (d) for expressing things that can be (this call may
never return), and annotation like things (this region (denoted by
def-use of a token) is a reduction).

I don't grok OpenMP, but perhaps we can come with one or two
"generalized control flow"-type instructions that can be used to
model the non-call/invoke like semantics we'd like LLVM to know
about, and model the rest with (d)?

-- Sanjoy

On Thu, Jan 19, 2017 at 8:28 PM, Hal Finkel via llvm-dev
<llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>>
wrote:

On 01/19/2017 03:36 PM, Mehdi Amini via llvm-dev wrote:


On Jan 19, 2017, at 1:32 PM, Daniel Berlin <dberlin at
dberlin.org<mailto:dberlin at dberlin.org>> wrote:



On Thu, Jan 19, 2017 at 1:12 PM, Mehdi Amini <mehdi.amini at
apple.com<mailto:mehdi.amini at apple.com>> wrote:


On Jan 19, 2017, at 12:04 PM, Daniel Berlin <dberlin at
dberlin.org<mailto:dberlin at dberlin.org>> wrote:



On Thu, Jan 19, 2017 at 11:46 AM, Mehdi Amini via llvm-dev
<llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>>
wrote:


On Jan 19, 2017, at 11:36 AM, Adve, Vikram Sadanand via llvm-dev
<llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>>
wrote:

Hi Johannes,

I am especially curious where you get your data from. Tapir [0]
(and to some degree PIR [1]) have shown that,
counterintuitively, only a few changes to LLVM passes are
needed. Tapir was recently used in an MIT class with a lot of
students and it seemed to work well with only minimal changes to
analysis and especially transformation passes.

TAPIR is an elegant, small extension and, in particular, I think
the idea of asymmetric parallel tasks and control flow is a
clever way to express parallelism with serial semantics, as in
Cilk.  Encoding the control flow extensions as explicit
instructions is orthogonal to that, though arguably more elegant than using
region tags + metadata.

However, Cilk is a tiny language compared with the full
complexity of other languages, like OpenMP.  To take just one
example, TAPIR cannot express the ORDERED construct of OpenMP.  A
more serious concern, IMO, is that TAPIR (like Cilk) requires
serial semantics, whereas there are many parallel languages, OpenMP included,
that do not obey that restriction.
Third, OpenMP has *numerous* clauses, e.g., REDUCTION or PRIVATE,
that are needed because without that, you’d be dependent on
fundamentally hard compiler analyses to extract the same
information for satisfactory parallel performance; realistic
applications cannot depend on the success of such analyses.

I agree with this, but I’m also wondering if it needs to be first
class in the IR?
For example we know our alias analysis is very basic, and C/C++
have a higher constraint thanks to their type system, but we
didn’t inject this higher level information that helps the
optimizer as first class IR constructs.


FWIW, while i agree with the general point, i wouldn't use this example.
Because we pretty much still suffer to this day because of it (both
in AA, and devirt, and ...)  :) We can't always even tell fields
apart


Is it inherent to the infrastructure, i.e. using metadata instead
of first class IR construct or is it just a “quality of implementation” issue?

Not to derail this conversation:

IMHO, At some point there is no real difference :)

Because otherwise, everything is a QOI issue.

IE if it's super tricky to get metadata that works well and works
right, doesn't get lost, etc, and that's inherent to using metadata,
that to me is not a QOI issue.

So could it be done with metadata? Probably?
But at the same time,  if it had been done with more first class
constructs, it would have happened years ago  and been much lower cost.


This is what I meant by “inherent to the infrastructure”, thanks for
clarifying.


To clarify, we were proposing metadata that is used as arguments to
the region-annotation intrinsics. This metadata has the nice
property that it does not get dropped (so it is just being used as a
way of encoding whatever data structures are necessary without predefining a
syntactic schema).

-Hal


—
Mehdi




_______________________________________________
LLVM Developers mailing list
llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev


--
Hal Finkel
Lead, Compiler Technology and Programming Languages Leadership
Computing Facility Argonne National Laboratory


_______________________________________________
LLVM Developers mailing list
llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

_______________________________________________
LLVM Developers mailing list
llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
_______________________________________________
LLVM Developers mailing list
llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev




-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170201/5e64c24c/attachment-0001.html>

Hal Finkel via llvm-dev

2017-Feb-01 17:45 UTC

head link

[llvm-dev] [RFC] IR-level Region Annotations

On 01/31/2017 09:30 PM, Mehdi Amini via llvm-dev wrote:>
>
> Sent from my iPhone
>
> On Jan 31, 2017, at 7:27 PM, Tian, Xinmin <xinmin.tian at intel.com 
> <mailto:xinmin.tian at intel.com>> wrote:
>
>> Remember that,  the prepare-phase is invoked in the FE or right after 
>> FE, inlining is not happening, that is why we don't call it
"pass".
>>  Chandler made a good point for this case a long time back.
>>
>
> What I was describing is the inlining in the optimizer pipeline.
FWIW, we only do this for allocas in the entry block, and it would be 
trivial to having the code stop scanning the entry block upon hitting 
some specifically-tagged thing(s).
>
>
>
>> Hoisting alloca is totally ok.  A new alloca is generated during 
>> outlining later on for anything marked as "private" (so long
the
>> "private" information is saves in the tag). I thought we
talked this
>> in an early email.
>
>
> Can you describe how (and at which point) you get the private for 
> "var"  added to the tag?
I don't think this is necessarily the scheme that we'd use for an 
upstream implementation.

Thanks again,
Hal
>
>
> -- 
> Mehdi
>
>>
>> By the way, all concerns you have are all valid, we had worked on 
>> resolving these issues 10+ years back when we did similar things in 
>> our compilers. I wouldn't claim we have perfect solutions, but we
do
>> reasonable good solutions for handling general directives and openmp 
>> directives.
>>
>> Xinmin
>>
>> -----Original Message-----
>> From: mehdi.amini at apple.com <mailto:mehdi.amini at apple.com> 
>> [mailto:mehdi.amini at apple.com]
>> Sent: Tuesday, January 31, 2017 7:08 PM
>> To: Tian, Xinmin <xinmin.tian at intel.com <mailto:xinmin.tian at
intel.com>>
>> Cc: Sanjoy Das <sanjoy at playingwithpointers.com 
>> <mailto:sanjoy at playingwithpointers.com>>; Adve, Vikram
Sadanand
>> <vadve at illinois.edu <mailto:vadve at illinois.edu>>; 
>> llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>; 
>> llvm-dev-request at lists.llvm.org <mailto:llvm-dev-request at
lists.llvm.org>
>> Subject: Re: [llvm-dev] [RFC] IR-level Region Annotations
>>
>>
>>> On Jan 31, 2017, at 6:48 PM, Tian, Xinmin <xinmin.tian at
intel.com
>>> <mailto:xinmin.tian at intel.com>> wrote:
>>>
>>> Let me try this.
>>>
>>> You can simply consider the prepare-phase (e.g. pre-privatization) 
>>>  were done in FE  (actually a library can be used by multiple FEs
at
>>> LLVM IR level),  the region is run with 1 thread,  region
annotation
>>> (scope, single-entry-single-exit) as memory barrier conservatively 
>>> for now (instead of checking individual memory dependency, aliasing
>>> via tags which is the actual implementation is done) marked with 
>>> region intrinsic functions. What optimization will mess up  with 
>>> this region-annotation?
>>
>> The first thing that comes to my mind is inlining that can put the IR 
>> in a form that breaks the invariant you tried to enforce with your 
>> "prepare-phase” (for example by hoisting an allocas).
>>
>> —
>> Mehdi
>>
>>
>>>
>>>
>>> -----Original Message-----
>>> From: mehdi.amini at apple.com <mailto:mehdi.amini at
apple.com>
>>> [mailto:mehdi.amini at apple.com]
>>> Sent: Tuesday, January 31, 2017 5:47 PM
>>> To: Tian, Xinmin <xinmin.tian at intel.com
<mailto:xinmin.tian at intel.com>>
>>> Cc: Sanjoy Das <sanjoy at playingwithpointers.com 
>>> <mailto:sanjoy at playingwithpointers.com>>; Adve, Vikram
Sadanand
>>> <vadve at illinois.edu <mailto:vadve at illinois.edu>>;
>>> llvm-dev at lists.llvm.org <mailto:llvm-dev at
lists.llvm.org>;
>>> llvm-dev-request at lists.llvm.org <mailto:llvm-dev-request at
lists.llvm.org>
>>> Subject: Re: [llvm-dev] [RFC] IR-level Region Annotations
>>>
>>>
>>>> On Jan 31, 2017, at 5:38 PM, Tian, Xinmin <xinmin.tian at
intel.com
>>>> <mailto:xinmin.tian at intel.com>> wrote:
>>>>
>>>>>>>> Ok, but this looks like a “workaround" for
your specific
>>>>>>>> use-case, I don’t see how it can scale as a
model-agnostic and
>>>>>>>> general-purpose region semantic.
>>>>
>>>> I would say it is a design trade-off.
>>>
>>> I’m not sure if we’re talking about the same thing here: my 
>>> understanding at this point is that the design trading-off you’re 
>>> making “simplicity” by scarifying “correctness”.
>>>
>>> Requiring the IR to stay in what you’re calling a “canonical” form 
>>> in your answer to Sanjoy in order to not miscompile a program is
not
>>> an approach that seems compatible with how we deal with the IR
usually.
>>>
>>>> Regardless it is a new instruction or an intrinsics with
token/tag,
>>>> it will consist of  model-agnostic part and model-non-agnostic 
>>>> part. The package comes with a framework for parsing and using 
>>>> these intrinsics.  See the reply I had for Sanjoy's email.
>>>
>>> The answer to Sanjoy is not really helpful to clarify anything to 
>>> me. At this point I still don’t understand how this is supposed to 
>>> be correct in general.
>>>
>>> It would be helpful to have a LangRef patch that describes the 
>>> semantic associated to your region intrinsics. Then we may be able 
>>> to process some examples through the formalized description.
>>>
>>> —
>>> Mehdi
>>>
>>>
>>>>
>>>> -----Original Message-----
>>>> From: mehdi.amini at apple.com <mailto:mehdi.amini at
apple.com>
>>>> [mailto:mehdi.amini at apple.com]
>>>> Sent: Saturday, January 21, 2017 1:57 PM
>>>> To: Tian, Xinmin <xinmin.tian at intel.com
<mailto:xinmin.tian at intel.com>>
>>>> Cc: Sanjoy Das <sanjoy at playingwithpointers.com 
>>>> <mailto:sanjoy at playingwithpointers.com>>; Adve,
Vikram Sadanand
>>>> <vadve at illinois.edu <mailto:vadve at
illinois.edu>>;
>>>> llvm-dev at lists.llvm.org <mailto:llvm-dev at
lists.llvm.org>;
>>>> llvm-dev-request at lists.llvm.org 
>>>> <mailto:llvm-dev-request at lists.llvm.org>
>>>> Subject: Re: [llvm-dev] [RFC] IR-level Region Annotations
>>>>
>>>>
>>>>> On Jan 20, 2017, at 11:17 AM, Tian, Xinmin <xinmin.tian
at intel.com
>>>>> <mailto:xinmin.tian at intel.com>> wrote:
>>>>>
>>>>>>>>> This means that the optimizer has to be
aware of it, I’m
>>>>>>>>> missing the magic here?
>>>>>
>>>>> This is one option.
>>>>>
>>>>> The another option is that, as I mentioned in our LLVM-HPC
paper
>>>>> in our implementation. We have a "prepare phase for 
>>>>> pre-privatization" can be invoked by both Clang FE and
Fortran FE
>>>>> right after LLVM IR is generated. So, in this way, we are
able to
>>>>> minimize the optimizations impact for the original val and
I
>>>>
>>>> Ok, but this looks like a “workaround" for your specific
use-case,
>>>> I don’t see how it can scale as a model-agnostic and 
>>>> general-purpose region semantic.
>>>>
>>>> The fact that you needed this pre-step in the first place seems
to
>>>> indicate to me that it confirms what multiple people expressed
in
>>>> this thread, for example what Daniel wrote here: 
>>>>
http://lists.llvm.org/pipermail/llvm-dev/2017-January/108997.html
>>>>
>>>> —
>>>> Mehdi
>>>>
>>>>
>>>>
>>>>>
>>>>> { void main() {
>>>>> i32* val = alloca i32
>>>>> i32* I = alloca 32
>>>>> i32* priv_val = alloca i32
>>>>> i32* priv_i alloca 32
>>>>> tok = llvm.experimental.intrinsic_a()[ 
>>>>> "DIR.PARALLEL"(),"QUAL.PRIVATE"(i32*
%priv_val, i32 %priv_i),
>>>>> "QUAL.NUM_THREADS"(i32 4)]
>>>>>
>>>>> %priv_i = omp_get_thread_num();
>>>>> compute_something_into_val(%priv_val, %priv_i);
>>>>> a[priv_i] = %priv_val;
>>>>>
>>>>>
llvm.experimental.intrinsic_b(tok)["DIR.END.PARALLEL"()];
>>>>> ....
>>>>>
>>>>> I >>>>> Val >>>>> Foo(val,
i).
>>>>> }
>>>>>
>>>>> "Prepare phase" is our way of minimizing the
impact to existing
>>>>> optimizations.
>>>>>
>>>>> Xinmin
>>>>>
>>>>> -----Original Message-----
>>>>> From: mehdi.amini at apple.com <mailto:mehdi.amini at
apple.com>
>>>>> [mailto:mehdi.amini at apple.com]
>>>>> Sent: Friday, January 20, 2017 10:54 AM
>>>>> To: Tian, Xinmin
>>>>> Cc: Sanjoy Das; Adve, Vikram Sadanand; llvm-dev at
lists.llvm.org
>>>>> <mailto:llvm-dev at lists.llvm.org>; llvm-dev-request
at lists.llvm.org
>>>>> <mailto:llvm-dev-request at lists.llvm.org>
>>>>> Subject: Re: [llvm-dev] [RFC] IR-level Region Annotations
>>>>>
>>>>>
>>>>>> On Jan 20, 2017, at 10:44 AM, Tian, Xinmin via llvm-dev
>>>>>> <llvm-dev at lists.llvm.org <mailto:llvm-dev at
lists.llvm.org>> wrote:
>>>>>>
>>>>>> Sanjoy, the IR would be like something below. It is ok
to hoist
>>>>>> alloca instruction outside the region. There are some
small
>>>>>> changes in optimizer to understand region-annotation
intrinsic.
>>>>>>
>>>>>> { void main() {
>>>>>> i32* val = alloca i32
>>>>>> tok = llvm.experimental.intrinsic_a()[
"DIR.PARALLEL"(),
>>>>>> "QUAL.PRIVATE"(i32* val),
"QUAL.NUM_THREADS"(i32 4)]
>>>>>>
>>>>>> int i = omp_get_thread_num();
>>>>>> compute_something_into_val(val, i);
>>>>>> a[i] = val;
>>>>>>
>>>>>>
llvm.experimental.intrinsic_b(tok)["DIR.END.PARALLEL"()];
>>>>>> }
>>>>>>
>>>>>> With above representation, we can do privatization and
outlining as
>>>>>> below
>>>>>>
>>>>>> { void main() {
>>>>>> i32* val = alloca i32
>>>>>> i32* I = alloca 32
>>>>>> tok = llvm.experimental.intrinsic_a()[
"DIR.PARALLEL"(),
>>>>>> "QUAL.PRIVATE"(i32* %val, i32 %i),
"QUAL.NUM_THREADS"(i32 4)]
>>>>>>
>>>>>> %ii = omp_get_thread_num();
>>>>>> compute_something_into_val(%val, %i);  a[i] = %val;
>>>>>>
>>>>>>
llvm.experimental.intrinsic_b(tok)["DIR.END.PARALLEL"()];
>>>>>> }
>>>>>>
>>>>>
>>>>> Here we come to the interesting part: the hoisting of 
"i32* I =
>>>>> alloca 32” above the intrinsics required to update the
intrinsics
>>>>> information “QUAL.PRIVATE”.
>>>>> This means that the optimizer has to be aware of it, I’m
missing
>>>>> the magic here?
>>>>> I understand that an openmp specific optimization can do
it, the
>>>>> question is how it an openmp agnostic supposed to behave in
face
>>>>> of llvm.experimental.intrinsic_a?
>>>>>
>>>>> —
>>>>> Mehdi
>>>>>
>>>>>
>>>>>
>>>>>> 1. create i32* priv_val = alloca i32  %priv_i = ...in
the region,
>>>>>> and replace all  %val with %prv_val in the region.
>>>>>> 2. perform outlining.
>>>>>>
>>>>>> Caller code
>>>>>> ....
>>>>>> omp_push_num_threads(4)
>>>>>> omp_fork_call( .... outline_par_region....) ....
>>>>>>
>>>>>> Callee code:
>>>>>> Outlined_par_rgion {
>>>>>> I32* priv_val = alloca 32
>>>>>> I32* priv_i = ....
>>>>>>
>>>>>> Ret
>>>>>> }
>>>>>>
>>>>>> For OpenMP, we do support it at -O0, -O1, -O2 and -O3. 
We had to
>>>>>> make sure it runs correctly w/ and w/o optimizations
and advanced
>>>>>> analysis. So we need to preserve all source information
for BE.
>>>>>> You can take a look our LLVM-HPC paper for a bit some
details.
>>>>>>  There are still tons of work to be done. Thanks.
>>>>>>
>>>>>> Xinmin
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: llvm-dev [mailto:llvm-dev-bounces at
lists.llvm.org] On Behalf
>>>>>> Of Sanjoy Das via llvm-dev
>>>>>> Sent: Thursday, January 19, 2017 10:13 PM
>>>>>> To: Adve, Vikram Sadanand <vadve at illinois.edu 
>>>>>> <mailto:vadve at illinois.edu>>
>>>>>> Cc: llvm-dev <llvm-dev at lists.llvm.org 
>>>>>> <mailto:llvm-dev at lists.llvm.org>>; 
>>>>>> llvm-dev-request at lists.llvm.org 
>>>>>> <mailto:llvm-dev-request at lists.llvm.org>
>>>>>> Subject: Re: [llvm-dev] [RFC] IR-level Region
Annotations
>>>>>>
>>>>>> Hi Vikram,
>>>>>>
>>>>>> On Thu, Jan 19, 2017 at 9:27 PM, Adve, Vikram Sadanand 
>>>>>> <vadve at illinois.edu <mailto:vadve at
illinois.edu>> wrote:
>>>>>>> Hi Sanjoy,
>>>>>>>
>>>>>>> Yes, that's exactly what we have been looking
at recently here, but
>>>>>>> the region tags seem to make it possible to express
the control
>>>>>>> flow
>>>>>>> as well, so I think we could start with reg
ions+metadata, as
>>>>>>> Hal and
>>>>>>
>>>>>> I'm not yet convinced that region tags are
sufficient to model
>>>>>> exotic control flow.
>>>>>>
>>>>>> (I don't know OpenMP so this is a
copy-pasted-edited example)
>>>>>>
>>>>>> Say we have:
>>>>>>
>>>>>> void main() {
>>>>>> #pragma omp parallel num_threads(4)
>>>>>> {
>>>>>> int i = omp_get_thread_num();
>>>>>> int val;
>>>>>> compute_something_into_val(&val, i);
>>>>>> a[i] = val;
>>>>>> }
>>>>>> }
>>>>>>
>>>>>> I presume the (eventual) intended lowering is something
like this
>>>>>> (if the intended lowering is different than this, and
avoids the
>>>>>> issue I'm trying to highlight then my point is
moot):
>>>>>>
>>>>>> void main() {
>>>>>> tok = llvm.experimental.intrinsic_a();
>>>>>>
>>>>>> int i = omp_get_thread_num();
>>>>>> i32* val = alloca i32
>>>>>> compute_something_into_val(val, i);
>>>>>> a[i] = val;
>>>>>>
>>>>>> llvm.experimental.intrinsic_b(tok);
>>>>>> }
>>>>>>
>>>>>> However, LLVM is free to hoist the alloca to the entry
block:
>>>>>>
>>>>>> void main() {
>>>>>> i32* val = alloca i32
>>>>>> tok = llvm.experimental.intrinsic_a();
>>>>>>
>>>>>> int i = omp_get_thread_num();
>>>>>> compute_something_into_val(val, i);
>>>>>> a[i] = val;
>>>>>>
>>>>>> llvm.experimental.intrinsic_b(tok);
>>>>>> }
>>>>>>
>>>>>> and now you have a race between the four parallel
forks.
>>>>>>
>>>>>> The problem here is that nothing in the IR expresses
that we have
>>>>>> four copies of the region running "at the same
time".  In fact,
>>>>>> such a control flow is alien to LLVM today.
>>>>>>
>>>>>> For instance, another evil optimization may turn:
>>>>>>
>>>>>> void main() {
>>>>>> int a[4];
>>>>>> #pragma omp parallel num_threads(4)
>>>>>> {
>>>>>> int i = omp_get_thread_num();
>>>>>> int val = compute_something_into_val(i);
>>>>>> a[i] = val;
>>>>>> }
>>>>>>
>>>>>> return a[0] + a[1];
>>>>>> }
>>>>>>
>>>>>> to
>>>>>>
>>>>>> void main() {
>>>>>> int a[4];
>>>>>> #pragma omp parallel num_threads(4)
>>>>>> {
>>>>>> int i = omp_get_thread_num();
>>>>>> int val = compute_something_into_val(i);
>>>>>> a[i] = val;
>>>>>> }
>>>>>>
>>>>>> return undef;
>>>>>> }
>>>>>>
>>>>>> since a[i] = val could have initialized at most one
element in a.
>>>>>>
>>>>>> Now you could say that the
llvm.experimental.intrinsic_a and
>>>>>> llvm.experimental.intrinsic_b intrinsics are magic, and
even such
>>>>>> "obvious" optimizations are not allowed to
happen across them;
>>>>>> but then calls to these intrinsics is pretty
fundamentally
>>>>>> different from "normal" calls, and are
probably best modeled as
>>>>>> new instructions.
>>>>>> You're going to have to do the same kind of
auditing of passes
>>>>>> either way, and the only extra cost of a new
instruction is the
>>>>>> extra bitcode reading / writing code.
>>>>>>
>>>>>> I hope I made sense.
>>>>>>
>>>>>> -- Sanjoy
>>>>>>
>>>>>>> Xinmin proposed, and then figure out what needs to
be first class
>>>>>>> instructions.
>>>>>>
>>>>>>>
>>>>>>> --Vikram Adve
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> On Jan 19, 2017, at 11:03 PM, Sanjoy Das 
>>>>>>>> <sanjoy at playingwithpointers.com 
>>>>>>>> <mailto:sanjoy at
playingwithpointers.com>> wrote:
>>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> My bias is to use both (b) and (d), since they
have complementary
>>>>>>>> strengths.  We should use (b) for expressing
concepts that
>>>>>>>> can't be
>>>>>>>> semantically modeled as a call or invoke (this
branch takes
>>>>>>>> both its
>>>>>>>> successors), and (d) for expressing things that
can be (this
>>>>>>>> call may
>>>>>>>> never return), and annotation like things (this
region (denoted by
>>>>>>>> def-use of a token) is a reduction).
>>>>>>>>
>>>>>>>> I don't grok OpenMP, but perhaps we can
come with one or two
>>>>>>>> "generalized control flow"-type
instructions that can be used to
>>>>>>>> model the non-call/invoke like semantics
we'd like LLVM to know
>>>>>>>> about, and model the rest with (d)?
>>>>>>>>
>>>>>>>> -- Sanjoy
>>>>>>>>
>>>>>>>> On Thu, Jan 19, 2017 at 8:28 PM, Hal Finkel via
llvm-dev
>>>>>>>> <llvm-dev at lists.llvm.org
<mailto:llvm-dev at lists.llvm.org>> wrote:
>>>>>>>>>
>>>>>>>>> On 01/19/2017 03:36 PM, Mehdi Amini via
llvm-dev wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Jan 19, 2017, at 1:32 PM, Daniel Berlin 
>>>>>>>>> <dberlin at dberlin.org
<mailto:dberlin at dberlin.org>> wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> On Thu, Jan 19, 2017 at 1:12 PM, Mehdi
Amini
>>>>>>>>>> <mehdi.amini at apple.com
<mailto:mehdi.amini at apple.com>> wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Jan 19, 2017, at 12:04 PM, Daniel
Berlin
>>>>>>>>>> <dberlin at dberlin.org
<mailto:dberlin at dberlin.org>> wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Thu, Jan 19, 2017 at 11:46 AM, Mehdi
Amini via llvm-dev
>>>>>>>>>> <llvm-dev at lists.llvm.org
<mailto:llvm-dev at lists.llvm.org>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> On Jan 19, 2017, at 11:36 AM,
Adve, Vikram Sadanand via
>>>>>>>>>>>> llvm-dev
>>>>>>>>>>>> <llvm-dev at lists.llvm.org
<mailto:llvm-dev at lists.llvm.org>>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> Hi Johannes,
>>>>>>>>>>>>
>>>>>>>>>>>>> I am especially curious
where you get your data from.
>>>>>>>>>>>>> Tapir [0]
>>>>>>>>>>>>> (and to some degree PIR
[1]) have shown that,
>>>>>>>>>>>>> counterintuitively, only a
few changes to LLVM passes are
>>>>>>>>>>>>> needed. Tapir was recently
used in an MIT class with a lot of
>>>>>>>>>>>>> students and it seemed to
work well with only minimal
>>>>>>>>>>>>> changes to
>>>>>>>>>>>>> analysis and especially
transformation passes.
>>>>>>>>>>>>
>>>>>>>>>>>> TAPIR is an elegant, small
extension and, in particular, I
>>>>>>>>>>>> think
>>>>>>>>>>>> the idea of asymmetric parallel
tasks and control flow is a
>>>>>>>>>>>> clever way to express
parallelism with serial semantics, as in
>>>>>>>>>>>> Cilk.  Encoding the control
flow extensions as explicit
>>>>>>>>>>>> instructions is orthogonal to
that, though arguably more
>>>>>>>>>>>> elegant than using region tags
+ metadata.
>>>>>>>>>>>>
>>>>>>>>>>>> However, Cilk is a tiny
language compared with the full
>>>>>>>>>>>> complexity of other languages,
like OpenMP.  To take just one
>>>>>>>>>>>> example, TAPIR cannot express
the ORDERED construct of
>>>>>>>>>>>> OpenMP.  A
>>>>>>>>>>>> more serious concern, IMO, is
that TAPIR (like Cilk) requires
>>>>>>>>>>>> serial semantics, whereas there
are many parallel
>>>>>>>>>>>> languages, OpenMP included,
that do not obey that restriction.
>>>>>>>>>>>> Third, OpenMP has *numerous*
clauses, e.g., REDUCTION or
>>>>>>>>>>>> PRIVATE,
>>>>>>>>>>>> that are needed because without
that, you’d be dependent on
>>>>>>>>>>>> fundamentally hard compiler
analyses to extract the same
>>>>>>>>>>>> information for satisfactory
parallel performance; realistic
>>>>>>>>>>>> applications cannot depend on
the success of such analyses.
>>>>>>>>>>>
>>>>>>>>>>> I agree with this, but I’m also
wondering if it needs to be
>>>>>>>>>>> first
>>>>>>>>>>> class in the IR?
>>>>>>>>>>> For example we know our alias
analysis is very basic, and C/C++
>>>>>>>>>>> have a higher constraint thanks to
their type system, but we
>>>>>>>>>>> didn’t inject this higher level
information that helps the
>>>>>>>>>>> optimizer as first class IR
constructs.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> FWIW, while i agree with the general
point, i wouldn't use
>>>>>>>>>> this example.
>>>>>>>>>> Because we pretty much still suffer to
this day because of it
>>>>>>>>>> (both
>>>>>>>>>> in AA, and devirt, and ...)  :) We
can't always even tell fields
>>>>>>>>>> apart
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Is it inherent to the infrastructure,
i.e. using metadata
>>>>>>>>>> instead
>>>>>>>>>> of first class IR construct or is it
just a “quality of
>>>>>>>>>> implementation” issue?
>>>>>>>>>
>>>>>>>>> Not to derail this conversation:
>>>>>>>>>
>>>>>>>>> IMHO, At some point there is no real
difference :)
>>>>>>>>>
>>>>>>>>> Because otherwise, everything is a QOI
issue.
>>>>>>>>>
>>>>>>>>> IE if it's super tricky to get metadata
that works well and works
>>>>>>>>> right, doesn't get lost, etc, and
that's inherent to using
>>>>>>>>> metadata,
>>>>>>>>> that to me is not a QOI issue.
>>>>>>>>>
>>>>>>>>> So could it be done with metadata?
Probably?
>>>>>>>>> But at the same time,  if it had been done
with more first class
>>>>>>>>> constructs, it would have happened years
ago  and been much
>>>>>>>>> lower cost.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> This is what I meant by “inherent to the
infrastructure”,
>>>>>>>>> thanks for
>>>>>>>>> clarifying.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> To clarify, we were proposing metadata that
is used as
>>>>>>>>> arguments to
>>>>>>>>> the region-annotation intrinsics. This
metadata has the nice
>>>>>>>>> property that it does not get dropped (so
it is just being
>>>>>>>>> used as a
>>>>>>>>> way of encoding whatever data structures
are necessary without
>>>>>>>>> predefining a syntactic schema).
>>>>>>>>>
>>>>>>>>> -Hal
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> —
>>>>>>>>> Mehdi
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
_______________________________________________
>>>>>>>>> LLVM Developers mailing list
>>>>>>>>> llvm-dev at lists.llvm.org
<mailto:llvm-dev at lists.llvm.org>
>>>>>>>>>
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Hal Finkel
>>>>>>>>> Lead, Compiler Technology and Programming
Languages Leadership
>>>>>>>>> Computing Facility Argonne National
Laboratory
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
_______________________________________________
>>>>>>>>> LLVM Developers mailing list
>>>>>>>>> llvm-dev at lists.llvm.org
<mailto:llvm-dev at lists.llvm.org>
>>>>>>>>>
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>>>>>>
>>>>>> _______________________________________________
>>>>>> LLVM Developers mailing list
>>>>>> llvm-dev at lists.llvm.org <mailto:llvm-dev at
lists.llvm.org>
>>>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>>> _______________________________________________
>>>>>> LLVM Developers mailing list
>>>>>> llvm-dev at lists.llvm.org <mailto:llvm-dev at
lists.llvm.org>
>>>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>>
>>>>
>>>
>>
>
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170201/a1235674/attachment-0001.html>

Reasonably Related Threads

Search for more maybe matching threads

llvm dev - Feb 2017 - [RFC] IR-level Region Annotations

[llvm-dev] [RFC] IR-level Region Annotations

[llvm-dev] [RFC] IR-level Region Annotations

[llvm-dev] [RFC] IR-level Region Annotations

[llvm-dev] [RFC] IR-level Region Annotations

[llvm-dev] [RFC] IR-level Region Annotations

[llvm-dev] [RFC] IR-level Region Annotations

Reasonably Related Threads