thr3ads.net - llvm dev - [llvm-dev] [RFC] IR-level Region Annotations [Jan 2017]

If this information is useful, please help other people find it:
Share via:

Adve, Vikram Sadanand via llvm-dev

2017-Jan-20 05:27 UTC

[llvm-dev] [RFC] IR-level Region Annotations

Hi Sanjoy,

Yes, that's exactly what we have been looking at recently here, but the
region tags seem to make it possible to express the control flow as well, so I
think we could start with reg ions+metadata, as Hal and Xinmin proposed, and
then figure out what needs to be first class instructions.

--Vikram Adve


> On Jan 19, 2017, at 11:03 PM, Sanjoy Das <sanjoy at
playingwithpointers.com> wrote:
> 
> Hi,
> 
> My bias is to use both (b) and (d), since they have complementary
> strengths.  We should use (b) for expressing concepts that can't be
> semantically modeled as a call or invoke (this branch takes both its
> successors), and (d) for expressing things that can be (this call may
> never return), and annotation like things (this region (denoted by
> def-use of a token) is a reduction).
> 
> I don't grok OpenMP, but perhaps we can come with one or two
> "generalized control flow"-type instructions that can be used to
model
> the non-call/invoke like semantics we'd like LLVM to know about, and
> model the rest with (d)?
> 
> -- Sanjoy
> 
> On Thu, Jan 19, 2017 at 8:28 PM, Hal Finkel via llvm-dev
> <llvm-dev at lists.llvm.org> wrote:
>> 
>> On 01/19/2017 03:36 PM, Mehdi Amini via llvm-dev wrote:
>> 
>> 
>> On Jan 19, 2017, at 1:32 PM, Daniel Berlin <dberlin at
dberlin.org> wrote:
>> 
>> 
>> 
>>> On Thu, Jan 19, 2017 at 1:12 PM, Mehdi Amini <mehdi.amini at
apple.com> wrote:
>>> 
>>> 
>>> On Jan 19, 2017, at 12:04 PM, Daniel Berlin <dberlin at
dberlin.org> wrote:
>>> 
>>> 
>>> 
>>> On Thu, Jan 19, 2017 at 11:46 AM, Mehdi Amini via llvm-dev
>>> <llvm-dev at lists.llvm.org> wrote:
>>>> 
>>>> 
>>>>> On Jan 19, 2017, at 11:36 AM, Adve, Vikram Sadanand via
llvm-dev
>>>>> <llvm-dev at lists.llvm.org> wrote:
>>>>> 
>>>>> Hi Johannes,
>>>>> 
>>>>>> I am especially curious where you get your data from.
Tapir [0] (and
>>>>>> to
>>>>>> some degree PIR [1]) have shown that,
counterintuitively, only a few
>>>>>> changes
>>>>>> to LLVM passes are needed. Tapir was recently used in
an MIT class
>>>>>> with a
>>>>>> lot of students and it seemed to work well with only
minimal changes
>>>>>> to analysis and especially transformation passes.
>>>>> 
>>>>> TAPIR is an elegant, small extension and, in particular, I
think the
>>>>> idea of asymmetric parallel tasks and control flow is a
clever way to
>>>>> express parallelism with serial semantics, as in Cilk. 
Encoding the control
>>>>> flow extensions as explicit instructions is orthogonal to
that, though
>>>>> arguably more elegant than using region tags + metadata.
>>>>> 
>>>>> However, Cilk is a tiny language compared with the full
complexity of
>>>>> other languages, like OpenMP.  To take just one example,
TAPIR cannot
>>>>> express the ORDERED construct of OpenMP.  A more serious
concern, IMO, is
>>>>> that TAPIR (like Cilk) requires serial semantics, whereas
there are many
>>>>> parallel languages, OpenMP included, that do not obey that
restriction.
>>>>> Third, OpenMP has *numerous* clauses, e.g., REDUCTION or
PRIVATE,  that are
>>>>> needed because without that, you’d be dependent on
fundamentally hard
>>>>> compiler analyses to extract the same information for
satisfactory parallel
>>>>> performance; realistic applications cannot depend on the
success of such
>>>>> analyses.
>>>> 
>>>> I agree with this, but I’m also wondering if it needs to be
first class
>>>> in the IR?
>>>> For example we know our alias analysis is very basic, and C/C++
have a
>>>> higher constraint thanks to their type system, but we didn’t
inject this
>>>> higher level information that helps the optimizer as first
class IR
>>>> constructs.
>>> 
>>> 
>>> FWIW, while i agree with the general point, i wouldn't use this
example.
>>> Because we pretty much still suffer to this day because of it (both
in AA,
>>> and devirt, and ...)  :)
>>> We can't always even tell fields apart
>>> 
>>> 
>>> Is it inherent to the infrastructure, i.e. using metadata instead
of first
>>> class IR construct or is it just a “quality of implementation”
issue?
>> 
>> Not to derail this conversation:
>> 
>> IMHO, At some point there is no real difference :)
>> 
>> Because otherwise, everything is a QOI issue.
>> 
>> IE if it's super tricky to get metadata that works well and works
right,
>> doesn't get lost, etc, and that's inherent to using metadata,
that to me is
>> not a QOI issue.
>> 
>> So could it be done with metadata? Probably?
>> But at the same time,  if it had been done with more first class
constructs,
>> it would have happened years ago  and been much lower cost.
>> 
>> 
>> This is what I meant by “inherent to the infrastructure”, thanks for
>> clarifying.
>> 
>> 
>> To clarify, we were proposing metadata that is used as arguments to the
>> region-annotation intrinsics. This metadata has the nice property that
it
>> does not get dropped (so it is just being used as a way of encoding
whatever
>> data structures are necessary without predefining a syntactic schema).
>> 
>> -Hal
>> 
>> 
>> —
>> Mehdi
>> 
>> 
>> 
>> 
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>> 
>> 
>> --
>> Hal Finkel
>> Lead, Compiler Technology and Programming Languages
>> Leadership Computing Facility
>> Argonne National Laboratory
>> 
>> 
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>

Sanjoy Das via llvm-dev

2017-Jan-20 06:12 UTC

head link

[llvm-dev] [RFC] IR-level Region Annotations

Hi Vikram,

On Thu, Jan 19, 2017 at 9:27 PM, Adve, Vikram Sadanand
<vadve at illinois.edu> wrote:> Hi Sanjoy,
>
> Yes, that's exactly what we have been looking at recently here, but
> the region tags seem to make it possible to express the control flow
> as well, so I think we could start with reg ions+metadata, as Hal and
I'm not yet convinced that region tags are sufficient to model exotic
control flow.

(I don't know OpenMP so this is a copy-pasted-edited example)

Say we have:

void main() {
  #pragma omp parallel num_threads(4)
  {
    int i = omp_get_thread_num();
    int val;
    compute_something_into_val(&val, i);
    a[i] = val;
  }
}

I presume the (eventual) intended lowering is something like this (if
the intended lowering is different than this, and avoids the issue I'm
trying to highlight then my point is moot):

void main() {
  tok = llvm.experimental.intrinsic_a();

  int i = omp_get_thread_num();
  i32* val = alloca i32
  compute_something_into_val(val, i);
  a[i] = val;

  llvm.experimental.intrinsic_b(tok);
}

However, LLVM is free to hoist the alloca to the entry block:

void main() {
  i32* val = alloca i32
  tok = llvm.experimental.intrinsic_a();

  int i = omp_get_thread_num();
  compute_something_into_val(val, i);
  a[i] = val;

  llvm.experimental.intrinsic_b(tok);
}

and now you have a race between the four parallel forks.

The problem here is that nothing in the IR expresses that we have four
copies of the region running "at the same time".  In fact, such a
control flow is alien to LLVM today.

For instance, another evil optimization may turn:

void main() {
  int a[4];
  #pragma omp parallel num_threads(4)
  {
    int i = omp_get_thread_num();
    int val = compute_something_into_val(i);
    a[i] = val;
  }

  return a[0] + a[1];
}

to

void main() {
  int a[4];
  #pragma omp parallel num_threads(4)
  {
    int i = omp_get_thread_num();
    int val = compute_something_into_val(i);
    a[i] = val;
  }

  return undef;
}

since a[i] = val could have initialized at most one element in a.

Now you could say that the llvm.experimental.intrinsic_a and
llvm.experimental.intrinsic_b intrinsics are magic, and even such
"obvious" optimizations are not allowed to happen across them; but
then
calls to these intrinsics is pretty fundamentally different from
"normal" calls, and are probably best modeled as new instructions.
You're going to have to do the same kind of auditing of passes either
way, and the only extra cost of a new instruction is the extra bitcode
reading / writing code.

I hope I made sense.

-- Sanjoy
> Xinmin proposed, and then figure out what needs to be first class
> instructions.
>
> --Vikram Adve
>
>
>
>> On Jan 19, 2017, at 11:03 PM, Sanjoy Das <sanjoy at
playingwithpointers.com> wrote:
>>
>> Hi,
>>
>> My bias is to use both (b) and (d), since they have complementary
>> strengths.  We should use (b) for expressing concepts that can't be
>> semantically modeled as a call or invoke (this branch takes both its
>> successors), and (d) for expressing things that can be (this call may
>> never return), and annotation like things (this region (denoted by
>> def-use of a token) is a reduction).
>>
>> I don't grok OpenMP, but perhaps we can come with one or two
>> "generalized control flow"-type instructions that can be used
to model
>> the non-call/invoke like semantics we'd like LLVM to know about,
and
>> model the rest with (d)?
>>
>> -- Sanjoy
>>
>> On Thu, Jan 19, 2017 at 8:28 PM, Hal Finkel via llvm-dev
>> <llvm-dev at lists.llvm.org> wrote:
>>>
>>> On 01/19/2017 03:36 PM, Mehdi Amini via llvm-dev wrote:
>>>
>>>
>>> On Jan 19, 2017, at 1:32 PM, Daniel Berlin <dberlin at
dberlin.org> wrote:
>>>
>>>
>>>
>>>> On Thu, Jan 19, 2017 at 1:12 PM, Mehdi Amini <mehdi.amini at
apple.com> wrote:
>>>>
>>>>
>>>> On Jan 19, 2017, at 12:04 PM, Daniel Berlin <dberlin at
dberlin.org> wrote:
>>>>
>>>>
>>>>
>>>> On Thu, Jan 19, 2017 at 11:46 AM, Mehdi Amini via llvm-dev
>>>> <llvm-dev at lists.llvm.org> wrote:
>>>>>
>>>>>
>>>>>> On Jan 19, 2017, at 11:36 AM, Adve, Vikram Sadanand via
llvm-dev
>>>>>> <llvm-dev at lists.llvm.org> wrote:
>>>>>>
>>>>>> Hi Johannes,
>>>>>>
>>>>>>> I am especially curious where you get your data
from. Tapir [0] (and
>>>>>>> to
>>>>>>> some degree PIR [1]) have shown that,
counterintuitively, only a few
>>>>>>> changes
>>>>>>> to LLVM passes are needed. Tapir was recently used
in an MIT class
>>>>>>> with a
>>>>>>> lot of students and it seemed to work well with
only minimal changes
>>>>>>> to analysis and especially transformation passes.
>>>>>>
>>>>>> TAPIR is an elegant, small extension and, in
particular, I think the
>>>>>> idea of asymmetric parallel tasks and control flow is a
clever way to
>>>>>> express parallelism with serial semantics, as in Cilk. 
Encoding the control
>>>>>> flow extensions as explicit instructions is orthogonal
to that, though
>>>>>> arguably more elegant than using region tags +
metadata.
>>>>>>
>>>>>> However, Cilk is a tiny language compared with the full
complexity of
>>>>>> other languages, like OpenMP.  To take just one
example, TAPIR cannot
>>>>>> express the ORDERED construct of OpenMP.  A more
serious concern, IMO, is
>>>>>> that TAPIR (like Cilk) requires serial semantics,
whereas there are many
>>>>>> parallel languages, OpenMP included, that do not obey
that restriction.
>>>>>> Third, OpenMP has *numerous* clauses, e.g., REDUCTION
or PRIVATE,  that are
>>>>>> needed because without that, you’d be dependent on
fundamentally hard
>>>>>> compiler analyses to extract the same information for
satisfactory parallel
>>>>>> performance; realistic applications cannot depend on
the success of such
>>>>>> analyses.
>>>>>
>>>>> I agree with this, but I’m also wondering if it needs to be
first class
>>>>> in the IR?
>>>>> For example we know our alias analysis is very basic, and
C/C++ have a
>>>>> higher constraint thanks to their type system, but we
didn’t inject this
>>>>> higher level information that helps the optimizer as first
class IR
>>>>> constructs.
>>>>
>>>>
>>>> FWIW, while i agree with the general point, i wouldn't use
this example.
>>>> Because we pretty much still suffer to this day because of it
(both in AA,
>>>> and devirt, and ...)  :)
>>>> We can't always even tell fields apart
>>>>
>>>>
>>>> Is it inherent to the infrastructure, i.e. using metadata
instead of first
>>>> class IR construct or is it just a “quality of implementation”
issue?
>>>
>>> Not to derail this conversation:
>>>
>>> IMHO, At some point there is no real difference :)
>>>
>>> Because otherwise, everything is a QOI issue.
>>>
>>> IE if it's super tricky to get metadata that works well and
works right,
>>> doesn't get lost, etc, and that's inherent to using
metadata, that to me is
>>> not a QOI issue.
>>>
>>> So could it be done with metadata? Probably?
>>> But at the same time,  if it had been done with more first class
constructs,
>>> it would have happened years ago  and been much lower cost.
>>>
>>>
>>> This is what I meant by “inherent to the infrastructure”, thanks
for
>>> clarifying.
>>>
>>>
>>> To clarify, we were proposing metadata that is used as arguments to
the
>>> region-annotation intrinsics. This metadata has the nice property
that it
>>> does not get dropped (so it is just being used as a way of encoding
whatever
>>> data structures are necessary without predefining a syntactic
schema).
>>>
>>> -Hal
>>>
>>>
>>> —
>>> Mehdi
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> llvm-dev at lists.llvm.org
>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>
>>>
>>> --
>>> Hal Finkel
>>> Lead, Compiler Technology and Programming Languages
>>> Leadership Computing Facility
>>> Argonne National Laboratory
>>>
>>>
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> llvm-dev at lists.llvm.org
>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>

Johannes Doerfert via llvm-dev

2017-Jan-20 13:01 UTC

head link

[llvm-dev] [RFC] IR-level Region Annotations

Hi Sanjoy,

thanks for the example! It is petty similar to what we have in our PIR
white paper (see page 10 in [0]). We think such cases nicely showcase
that we should use actual control flow (in combination with SSA and
dominance) to prevent current optimizations to break parallel invariants
even without them knowing about parallelism.

Cheers,
  Johannes

[0] http://compilers.cs.uni-saarland.de/people/doerfert/parallelcfg.pdf

On 01/19, Sanjoy Das via llvm-dev wrote:> Hi Vikram,
> 
> On Thu, Jan 19, 2017 at 9:27 PM, Adve, Vikram Sadanand
> <vadve at illinois.edu> wrote:
> > Hi Sanjoy,
> >
> > Yes, that's exactly what we have been looking at recently here,
but
> > the region tags seem to make it possible to express the control flow
> > as well, so I think we could start with reg ions+metadata, as Hal and
> 
> I'm not yet convinced that region tags are sufficient to model exotic
> control flow.
> 
> (I don't know OpenMP so this is a copy-pasted-edited example)
> 
> Say we have:
> 
> void main() {
>   #pragma omp parallel num_threads(4)
>   {
>     int i = omp_get_thread_num();
>     int val;
>     compute_something_into_val(&val, i);
>     a[i] = val;
>   }
> }
> 
> I presume the (eventual) intended lowering is something like this (if
> the intended lowering is different than this, and avoids the issue I'm
> trying to highlight then my point is moot):
> 
> void main() {
>   tok = llvm.experimental.intrinsic_a();
> 
>   int i = omp_get_thread_num();
>   i32* val = alloca i32
>   compute_something_into_val(val, i);
>   a[i] = val;
> 
>   llvm.experimental.intrinsic_b(tok);
> }
> 
> However, LLVM is free to hoist the alloca to the entry block:
> 
> void main() {
>   i32* val = alloca i32
>   tok = llvm.experimental.intrinsic_a();
> 
>   int i = omp_get_thread_num();
>   compute_something_into_val(val, i);
>   a[i] = val;
> 
>   llvm.experimental.intrinsic_b(tok);
> }
> 
> and now you have a race between the four parallel forks.
> 
> The problem here is that nothing in the IR expresses that we have four
> copies of the region running "at the same time".  In fact, such a
> control flow is alien to LLVM today.
> 
> For instance, another evil optimization may turn:
> 
> void main() {
>   int a[4];
>   #pragma omp parallel num_threads(4)
>   {
>     int i = omp_get_thread_num();
>     int val = compute_something_into_val(i);
>     a[i] = val;
>   }
> 
>   return a[0] + a[1];
> }
> 
> to
> 
> void main() {
>   int a[4];
>   #pragma omp parallel num_threads(4)
>   {
>     int i = omp_get_thread_num();
>     int val = compute_something_into_val(i);
>     a[i] = val;
>   }
> 
>   return undef;
> }
> 
> since a[i] = val could have initialized at most one element in a.
> 
> Now you could say that the llvm.experimental.intrinsic_a and
> llvm.experimental.intrinsic_b intrinsics are magic, and even such
> "obvious" optimizations are not allowed to happen across them;
but then
> calls to these intrinsics is pretty fundamentally different from
> "normal" calls, and are probably best modeled as new
instructions.
> You're going to have to do the same kind of auditing of passes either
> way, and the only extra cost of a new instruction is the extra bitcode
> reading / writing code.
> 
> I hope I made sense.
> 
> -- Sanjoy
> 
> > Xinmin proposed, and then figure out what needs to be first class
> > instructions.
> 
> >
> > --Vikram Adve
> >
> >
> >
> >> On Jan 19, 2017, at 11:03 PM, Sanjoy Das <sanjoy at
playingwithpointers.com> wrote:
> >>
> >> Hi,
> >>
> >> My bias is to use both (b) and (d), since they have complementary
> >> strengths.  We should use (b) for expressing concepts that
can't be
> >> semantically modeled as a call or invoke (this branch takes both
its
> >> successors), and (d) for expressing things that can be (this call
may
> >> never return), and annotation like things (this region (denoted by
> >> def-use of a token) is a reduction).
> >>
> >> I don't grok OpenMP, but perhaps we can come with one or two
> >> "generalized control flow"-type instructions that can be
used to model
> >> the non-call/invoke like semantics we'd like LLVM to know
about, and
> >> model the rest with (d)?
> >>
> >> -- Sanjoy
> >>
> >> On Thu, Jan 19, 2017 at 8:28 PM, Hal Finkel via llvm-dev
> >> <llvm-dev at lists.llvm.org> wrote:
> >>>
> >>> On 01/19/2017 03:36 PM, Mehdi Amini via llvm-dev wrote:
> >>>
> >>>
> >>> On Jan 19, 2017, at 1:32 PM, Daniel Berlin <dberlin at
dberlin.org> wrote:
> >>>
> >>>
> >>>
> >>>> On Thu, Jan 19, 2017 at 1:12 PM, Mehdi Amini
<mehdi.amini at apple.com> wrote:
> >>>>
> >>>>
> >>>> On Jan 19, 2017, at 12:04 PM, Daniel Berlin <dberlin at
dberlin.org> wrote:
> >>>>
> >>>>
> >>>>
> >>>> On Thu, Jan 19, 2017 at 11:46 AM, Mehdi Amini via llvm-dev
> >>>> <llvm-dev at lists.llvm.org> wrote:
> >>>>>
> >>>>>
> >>>>>> On Jan 19, 2017, at 11:36 AM, Adve, Vikram
Sadanand via llvm-dev
> >>>>>> <llvm-dev at lists.llvm.org> wrote:
> >>>>>>
> >>>>>> Hi Johannes,
> >>>>>>
> >>>>>>> I am especially curious where you get your
data from. Tapir [0] (and
> >>>>>>> to
> >>>>>>> some degree PIR [1]) have shown that,
counterintuitively, only a few
> >>>>>>> changes
> >>>>>>> to LLVM passes are needed. Tapir was recently
used in an MIT class
> >>>>>>> with a
> >>>>>>> lot of students and it seemed to work well
with only minimal changes
> >>>>>>> to analysis and especially transformation
passes.
> >>>>>>
> >>>>>> TAPIR is an elegant, small extension and, in
particular, I think the
> >>>>>> idea of asymmetric parallel tasks and control flow
is a clever way to
> >>>>>> express parallelism with serial semantics, as in
Cilk.  Encoding the control
> >>>>>> flow extensions as explicit instructions is
orthogonal to that, though
> >>>>>> arguably more elegant than using region tags +
metadata.
> >>>>>>
> >>>>>> However, Cilk is a tiny language compared with the
full complexity of
> >>>>>> other languages, like OpenMP.  To take just one
example, TAPIR cannot
> >>>>>> express the ORDERED construct of OpenMP.  A more
serious concern, IMO, is
> >>>>>> that TAPIR (like Cilk) requires serial semantics,
whereas there are many
> >>>>>> parallel languages, OpenMP included, that do not
obey that restriction.
> >>>>>> Third, OpenMP has *numerous* clauses, e.g.,
REDUCTION or PRIVATE,  that are
> >>>>>> needed because without that, you’d be dependent on
fundamentally hard
> >>>>>> compiler analyses to extract the same information
for satisfactory parallel
> >>>>>> performance; realistic applications cannot depend
on the success of such
> >>>>>> analyses.
> >>>>>
> >>>>> I agree with this, but I’m also wondering if it needs
to be first class
> >>>>> in the IR?
> >>>>> For example we know our alias analysis is very basic,
and C/C++ have a
> >>>>> higher constraint thanks to their type system, but we
didn’t inject this
> >>>>> higher level information that helps the optimizer as
first class IR
> >>>>> constructs.
> >>>>
> >>>>
> >>>> FWIW, while i agree with the general point, i wouldn't
use this example.
> >>>> Because we pretty much still suffer to this day because of
it (both in AA,
> >>>> and devirt, and ...)  :)
> >>>> We can't always even tell fields apart
> >>>>
> >>>>
> >>>> Is it inherent to the infrastructure, i.e. using metadata
instead of first
> >>>> class IR construct or is it just a “quality of
implementation” issue?
> >>>
> >>> Not to derail this conversation:
> >>>
> >>> IMHO, At some point there is no real difference :)
> >>>
> >>> Because otherwise, everything is a QOI issue.
> >>>
> >>> IE if it's super tricky to get metadata that works well
and works right,
> >>> doesn't get lost, etc, and that's inherent to using
metadata, that to me is
> >>> not a QOI issue.
> >>>
> >>> So could it be done with metadata? Probably?
> >>> But at the same time,  if it had been done with more first
class constructs,
> >>> it would have happened years ago  and been much lower cost.
> >>>
> >>>
> >>> This is what I meant by “inherent to the infrastructure”,
thanks for
> >>> clarifying.
> >>>
> >>>
> >>> To clarify, we were proposing metadata that is used as
arguments to the
> >>> region-annotation intrinsics. This metadata has the nice
property that it
> >>> does not get dropped (so it is just being used as a way of
encoding whatever
> >>> data structures are necessary without predefining a syntactic
schema).
> >>>
> >>> -Hal
> >>>
> >>>
> >>> —
> >>> Mehdi
> >>>
> >>>
> >>>
> >>>
> >>> _______________________________________________
> >>> LLVM Developers mailing list
> >>> llvm-dev at lists.llvm.org
> >>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> >>>
> >>>
> >>> --
> >>> Hal Finkel
> >>> Lead, Compiler Technology and Programming Languages
> >>> Leadership Computing Facility
> >>> Argonne National Laboratory
> >>>
> >>>
> >>> _______________________________________________
> >>> LLVM Developers mailing list
> >>> llvm-dev at lists.llvm.org
> >>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> >>>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
-- 

Johannes Doerfert
Researcher / PhD Student

Compiler Design Lab (Prof. Hack)
Saarland Informatics Campus, Germany
Building E1.3, Room 4.31

Tel. +49 (0)681 302-57521 : doerfert at cs.uni-saarland.de
Fax. +49 (0)681 302-3065  : http://www.cdl.uni-saarland.de/people/doerfert
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 228 bytes
Desc: Digital signature
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170120/ba773348/attachment.sig>

Tian, Xinmin via llvm-dev

2017-Jan-20 18:44 UTC

head link

[llvm-dev] [RFC] IR-level Region Annotations

Sanjoy, the IR would be like something below. It is ok to hoist alloca
instruction outside the region. There are some small changes in optimizer to
understand region-annotation intrinsic.

{ void main() {
  i32* val = alloca i32
  tok = llvm.experimental.intrinsic_a()[ "DIR.PARALLEL"(),
"QUAL.PRIVATE"(i32* val), "QUAL.NUM_THREADS"(i32 4)]

  int i = omp_get_thread_num();
  compute_something_into_val(val, i);
  a[i] = val;

  llvm.experimental.intrinsic_b(tok)["DIR.END.PARALLEL"()];
}

With above representation, we can do privatization and outlining as below

{ void main() {
  i32* val = alloca i32
  i32* I = alloca 32
  tok = llvm.experimental.intrinsic_a()[ "DIR.PARALLEL"(),
"QUAL.PRIVATE"(i32* %val, i32 %i), "QUAL.NUM_THREADS"(i32
4)]

  %ii = omp_get_thread_num();
  compute_something_into_val(%val, %i);
  a[i] = %val;

  llvm.experimental.intrinsic_b(tok)["DIR.END.PARALLEL"()];
}

1. create i32* priv_val = alloca i32  %priv_i = ...in the region, and replace
all  %val with %prv_val in the region.
2. perform outlining. 

Caller code
....
omp_push_num_threads(4)
omp_fork_call( .... outline_par_region....) ....

Callee code: 
Outlined_par_rgion {
   I32* priv_val = alloca 32
   I32* priv_i = ....

    Ret
}

For OpenMP, we do support it at -O0, -O1, -O2 and -O3.  We had to make sure it
runs correctly w/ and w/o optimizations and advanced analysis. So we need to
preserve all source information for BE.
You can take a look our LLVM-HPC paper for a bit some details.  There are still
tons of work to be done. Thanks.

Xinmin 

-----Original Message-----
From: llvm-dev [mailto:llvm-dev-bounces at lists.llvm.org] On Behalf Of Sanjoy
Das via llvm-dev
Sent: Thursday, January 19, 2017 10:13 PM
To: Adve, Vikram Sadanand <vadve at illinois.edu>
Cc: llvm-dev <llvm-dev at lists.llvm.org>; llvm-dev-request at
lists.llvm.org
Subject: Re: [llvm-dev] [RFC] IR-level Region Annotations

Hi Vikram,

On Thu, Jan 19, 2017 at 9:27 PM, Adve, Vikram Sadanand <vadve at
illinois.edu> wrote:> Hi Sanjoy,
>
> Yes, that's exactly what we have been looking at recently here, but 
> the region tags seem to make it possible to express the control flow 
> as well, so I think we could start with reg ions+metadata, as Hal and
I'm not yet convinced that region tags are sufficient to model exotic
control flow.

(I don't know OpenMP so this is a copy-pasted-edited example)

Say we have:

void main() {
  #pragma omp parallel num_threads(4)
  {
    int i = omp_get_thread_num();
    int val;
    compute_something_into_val(&val, i);
    a[i] = val;
  }
}

I presume the (eventual) intended lowering is something like this (if the
intended lowering is different than this, and avoids the issue I'm trying to
highlight then my point is moot):

void main() {
  tok = llvm.experimental.intrinsic_a();

  int i = omp_get_thread_num();
  i32* val = alloca i32
  compute_something_into_val(val, i);
  a[i] = val;

  llvm.experimental.intrinsic_b(tok);
}

However, LLVM is free to hoist the alloca to the entry block:

void main() {
  i32* val = alloca i32
  tok = llvm.experimental.intrinsic_a();

  int i = omp_get_thread_num();
  compute_something_into_val(val, i);
  a[i] = val;

  llvm.experimental.intrinsic_b(tok);
}

and now you have a race between the four parallel forks.

The problem here is that nothing in the IR expresses that we have four copies of
the region running "at the same time".  In fact, such a control flow
is alien to LLVM today.

For instance, another evil optimization may turn:

void main() {
  int a[4];
  #pragma omp parallel num_threads(4)
  {
    int i = omp_get_thread_num();
    int val = compute_something_into_val(i);
    a[i] = val;
  }

  return a[0] + a[1];
}

to

void main() {
  int a[4];
  #pragma omp parallel num_threads(4)
  {
    int i = omp_get_thread_num();
    int val = compute_something_into_val(i);
    a[i] = val;
  }

  return undef;
}

since a[i] = val could have initialized at most one element in a.

Now you could say that the llvm.experimental.intrinsic_a and
llvm.experimental.intrinsic_b intrinsics are magic, and even such
"obvious" optimizations are not allowed to happen across them; but
then calls to these intrinsics is pretty fundamentally different from
"normal" calls, and are probably best modeled as new instructions.
You're going to have to do the same kind of auditing of passes either way,
and the only extra cost of a new instruction is the extra bitcode reading /
writing code.

I hope I made sense.

-- Sanjoy
> Xinmin proposed, and then figure out what needs to be first class 
> instructions.
>
> --Vikram Adve
>
>
>
>> On Jan 19, 2017, at 11:03 PM, Sanjoy Das <sanjoy at
playingwithpointers.com> wrote:
>>
>> Hi,
>>
>> My bias is to use both (b) and (d), since they have complementary 
>> strengths.  We should use (b) for expressing concepts that can't be
>> semantically modeled as a call or invoke (this branch takes both its 
>> successors), and (d) for expressing things that can be (this call may 
>> never return), and annotation like things (this region (denoted by 
>> def-use of a token) is a reduction).
>>
>> I don't grok OpenMP, but perhaps we can come with one or two 
>> "generalized control flow"-type instructions that can be used
to
>> model the non-call/invoke like semantics we'd like LLVM to know 
>> about, and model the rest with (d)?
>>
>> -- Sanjoy
>>
>> On Thu, Jan 19, 2017 at 8:28 PM, Hal Finkel via llvm-dev 
>> <llvm-dev at lists.llvm.org> wrote:
>>>
>>> On 01/19/2017 03:36 PM, Mehdi Amini via llvm-dev wrote:
>>>
>>>
>>> On Jan 19, 2017, at 1:32 PM, Daniel Berlin <dberlin at
dberlin.org> wrote:
>>>
>>>
>>>
>>>> On Thu, Jan 19, 2017 at 1:12 PM, Mehdi Amini <mehdi.amini at
apple.com> wrote:
>>>>
>>>>
>>>> On Jan 19, 2017, at 12:04 PM, Daniel Berlin <dberlin at
dberlin.org> wrote:
>>>>
>>>>
>>>>
>>>> On Thu, Jan 19, 2017 at 11:46 AM, Mehdi Amini via llvm-dev 
>>>> <llvm-dev at lists.llvm.org> wrote:
>>>>>
>>>>>
>>>>>> On Jan 19, 2017, at 11:36 AM, Adve, Vikram Sadanand via
llvm-dev
>>>>>> <llvm-dev at lists.llvm.org> wrote:
>>>>>>
>>>>>> Hi Johannes,
>>>>>>
>>>>>>> I am especially curious where you get your data
from. Tapir [0]
>>>>>>> (and to some degree PIR [1]) have shown that, 
>>>>>>> counterintuitively, only a few changes to LLVM
passes are
>>>>>>> needed. Tapir was recently used in an MIT class
with a lot of
>>>>>>> students and it seemed to work well with only
minimal changes to
>>>>>>> analysis and especially transformation passes.
>>>>>>
>>>>>> TAPIR is an elegant, small extension and, in
particular, I think
>>>>>> the idea of asymmetric parallel tasks and control flow
is a
>>>>>> clever way to express parallelism with serial
semantics, as in
>>>>>> Cilk.  Encoding the control flow extensions as explicit
>>>>>> instructions is orthogonal to that, though arguably
more elegant than using region tags + metadata.
>>>>>>
>>>>>> However, Cilk is a tiny language compared with the full
>>>>>> complexity of other languages, like OpenMP.  To take
just one
>>>>>> example, TAPIR cannot express the ORDERED construct of
OpenMP.  A
>>>>>> more serious concern, IMO, is that TAPIR (like Cilk)
requires
>>>>>> serial semantics, whereas there are many parallel
languages, OpenMP included, that do not obey that restriction.
>>>>>> Third, OpenMP has *numerous* clauses, e.g., REDUCTION
or PRIVATE,
>>>>>> that are needed because without that, you’d be
dependent on
>>>>>> fundamentally hard compiler analyses to extract the
same
>>>>>> information for satisfactory parallel performance;
realistic
>>>>>> applications cannot depend on the success of such
analyses.
>>>>>
>>>>> I agree with this, but I’m also wondering if it needs to be
first
>>>>> class in the IR?
>>>>> For example we know our alias analysis is very basic, and
C/C++
>>>>> have a higher constraint thanks to their type system, but
we
>>>>> didn’t inject this higher level information that helps the 
>>>>> optimizer as first class IR constructs.
>>>>
>>>>
>>>> FWIW, while i agree with the general point, i wouldn't use
this example.
>>>> Because we pretty much still suffer to this day because of it
(both
>>>> in AA, and devirt, and ...)  :) We can't always even tell
fields
>>>> apart
>>>>
>>>>
>>>> Is it inherent to the infrastructure, i.e. using metadata
instead
>>>> of first class IR construct or is it just a “quality of
implementation” issue?
>>>
>>> Not to derail this conversation:
>>>
>>> IMHO, At some point there is no real difference :)
>>>
>>> Because otherwise, everything is a QOI issue.
>>>
>>> IE if it's super tricky to get metadata that works well and
works
>>> right, doesn't get lost, etc, and that's inherent to using
metadata,
>>> that to me is not a QOI issue.
>>>
>>> So could it be done with metadata? Probably?
>>> But at the same time,  if it had been done with more first class 
>>> constructs, it would have happened years ago  and been much lower
cost.
>>>
>>>
>>> This is what I meant by “inherent to the infrastructure”, thanks
for
>>> clarifying.
>>>
>>>
>>> To clarify, we were proposing metadata that is used as arguments to
>>> the region-annotation intrinsics. This metadata has the nice 
>>> property that it does not get dropped (so it is just being used as
a
>>> way of encoding whatever data structures are necessary without
predefining a syntactic schema).
>>>
>>> -Hal
>>>
>>>
>>> —
>>> Mehdi
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> llvm-dev at lists.llvm.org
>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>
>>>
>>> --
>>> Hal Finkel
>>> Lead, Compiler Technology and Programming Languages Leadership 
>>> Computing Facility Argonne National Laboratory
>>>
>>>
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> llvm-dev at lists.llvm.org
>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>_______________________________________________
LLVM Developers mailing list
llvm-dev at lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

Tian, Xinmin via llvm-dev

2017-Jan-20 18:45 UTC

head link

[llvm-dev] [RFC] IR-level Region Annotations

In the case,   "val" is shared per OpenMP language rule. There is no
privatization needed.    %val is on the stack of master,  to share %val among
all threads,  &val is passed to the outlined function.

void main() {
  int val;
  #pragma omp parallel num_threads(4)
  {
    // Really bad naming, won't pass code review. :)
    compute_something_into_val(&val, omp_get_thread_num());
  }
}

The IR would be. 

{ void main() {
  i32* val = alloca i32
   tok = llvm.experimental.intrinsic_a()[ "DIR.PARALLEL"(),
"QUAL.SHARED"(i32* %val), "QUAL.NUM_THREADS"(i32 4)
      %1 = omp_get_thread_num();
      compute_something_into_val(%val, %1);
   llvm.experimental.intrinsic_b(tok)["DIR.END.PARALLEL"()];
}

Xinmin

-----Original Message-----
From: Sanjoy Das [mailto:sanjoy at playingwithpointers.com]
Sent: Thursday, January 19, 2017 11:40 PM
To: Tian, Xinmin <xinmin.tian at intel.com>
Cc: Adve, Vikram Sadanand <vadve at illinois.edu>; llvm-dev-request at
lists.llvm.org
Subject: Re: [llvm-dev] [RFC] IR-level Region Annotations

On Thu, Jan 19, 2017 at 11:27 PM, Sanjoy Das <sanjoy at
playingwithpointers.com> wrote:> Hi Xinmin,
>
> On Thu, Jan 19, 2017 at 11:20 PM, Tian, Xinmin <xinmin.tian at
intel.com> wrote:
>> Sanjoy, the IR would be like something below. It is ok to hoist alloca
instruction outside the region. There are some small changes in optimizer to
understand region-annotation intrinsic.
>>
>> { void main() {
>>   i32* val = alloca i32
>>   tok = llvm.experimental.intrinsic_a()[ "DIR.PARALLEL"(),
>> "QUAL.PRIVATE"(i32* val), "QUAL.NUM_THREADS"(i32
4)]
>>
>>   int i = omp_get_thread_num();
>>   compute_something_into_val(val, i);
>>   a[i] = val;
>>
>>   llvm.experimental.intrinsic_b(tok)["DIR.END.PARALLEL"()];
>> }
>>
>> With above representation, we can do privatization and outlining as 
>> below
>>
>> { void main() {
>>   i32* val = alloca i32
>>   i32* I = alloca 32
>>   tok = llvm.experimental.intrinsic_a()[ "DIR.PARALLEL"(),
>> "QUAL.PRIVATE"(i32* %val, i32 %i),
"QUAL.NUM_THREADS"(i32 4)]
>>
>>   %ii = omp_get_thread_num();
>>   compute_something_into_val(%val, %i);
>>   a[i] = %val;
>>
>>   llvm.experimental.intrinsic_b(tok)["DIR.END.PARALLEL"()];
>> }
>>
>> 1. create i32* priv_val = alloca i32  %priv_i = ...in the region, and
replace all  %val with %prv_val in the region.
>> 2. perform outlining.
>
> But then what if compute_something_into_val is
>
> void compute_something_into_val(i32* ptr, i32 idx) {
>   static i32* cookie = null;
>   lock_mutex();
>   if (cookie == null)
>     cookie = ptr
>   else
>     assert(cookie == ptr);
>   unlock_mutex();
>   // don't write to ptr, so there is no race }
>
> In other words, how do you differentiate between the hoisted-alloca 
> situation arising due to a hoist vs. arising because that's what the 
> programmer intended (and you're required to pass in the same address 
> to each call into compute_something_into_val)?
Just to be a 100% clear, the source program in the latter case would have to be:

void main() {
  int val;
  #pragma omp parallel num_threads(4)
  {
    // Really bad naming, won't pass code review. :)
    compute_something_into_val(&val, omp_get_thread_num());
  }
}

-- Sanjoy

Mehdi Amini via llvm-dev

2017-Jan-20 18:54 UTC

head link

[llvm-dev] [RFC] IR-level Region Annotations

> On Jan 20, 2017, at 10:44 AM, Tian, Xinmin via llvm-dev <llvm-dev at
lists.llvm.org> wrote:
> 
> Sanjoy, the IR would be like something below. It is ok to hoist alloca
instruction outside the region. There are some small changes in optimizer to
understand region-annotation intrinsic.
> 
> { void main() {
>  i32* val = alloca i32
>  tok = llvm.experimental.intrinsic_a()[ "DIR.PARALLEL"(),
"QUAL.PRIVATE"(i32* val), "QUAL.NUM_THREADS"(i32 4)]
> 
>  int i = omp_get_thread_num();
>  compute_something_into_val(val, i);
>  a[i] = val;
> 
>  llvm.experimental.intrinsic_b(tok)["DIR.END.PARALLEL"()];
> }
> 
> With above representation, we can do privatization and outlining as below
> 
> { void main() {
>  i32* val = alloca i32
>  i32* I = alloca 32
>  tok = llvm.experimental.intrinsic_a()[ "DIR.PARALLEL"(),
"QUAL.PRIVATE"(i32* %val, i32 %i), "QUAL.NUM_THREADS"(i32
4)]
> 
>  %ii = omp_get_thread_num();
>  compute_something_into_val(%val, %i);
>  a[i] = %val;
> 
>  llvm.experimental.intrinsic_b(tok)["DIR.END.PARALLEL"()];
> }
> 
Here we come to the interesting part: the hoisting of  "i32* I = alloca 32”
above the intrinsics required to update the intrinsics information
“QUAL.PRIVATE”.
This means that the optimizer has to be aware of it, I’m missing the magic here?
I understand that an openmp specific optimization can do it, the question is how
it an openmp agnostic supposed to behave in face of
llvm.experimental.intrinsic_a?

— 
Mehdi


> 1. create i32* priv_val = alloca i32  %priv_i = ...in the region, and
replace all  %val with %prv_val in the region.
> 2. perform outlining. 
> 
> Caller code
> ....
> omp_push_num_threads(4)
> omp_fork_call( .... outline_par_region....) ....
> 
> Callee code: 
> Outlined_par_rgion {
>   I32* priv_val = alloca 32
>   I32* priv_i = ....
> 
>    Ret
> }
> 
> For OpenMP, we do support it at -O0, -O1, -O2 and -O3.  We had to make sure
it runs correctly w/ and w/o optimizations and advanced analysis. So we need to
preserve all source information for BE.
> You can take a look our LLVM-HPC paper for a bit some details.  There are
still tons of work to be done. Thanks.
> 
> Xinmin 
> 
> -----Original Message-----
> From: llvm-dev [mailto:llvm-dev-bounces at lists.llvm.org] On Behalf Of
Sanjoy Das via llvm-dev
> Sent: Thursday, January 19, 2017 10:13 PM
> To: Adve, Vikram Sadanand <vadve at illinois.edu>
> Cc: llvm-dev <llvm-dev at lists.llvm.org>; llvm-dev-request at
lists.llvm.org
> Subject: Re: [llvm-dev] [RFC] IR-level Region Annotations
> 
> Hi Vikram,
> 
> On Thu, Jan 19, 2017 at 9:27 PM, Adve, Vikram Sadanand <vadve at
illinois.edu> wrote:
>> Hi Sanjoy,
>> 
>> Yes, that's exactly what we have been looking at recently here, but
>> the region tags seem to make it possible to express the control flow 
>> as well, so I think we could start with reg ions+metadata, as Hal and
> 
> I'm not yet convinced that region tags are sufficient to model exotic
control flow.
> 
> (I don't know OpenMP so this is a copy-pasted-edited example)
> 
> Say we have:
> 
> void main() {
>  #pragma omp parallel num_threads(4)
>  {
>    int i = omp_get_thread_num();
>    int val;
>    compute_something_into_val(&val, i);
>    a[i] = val;
>  }
> }
> 
> I presume the (eventual) intended lowering is something like this (if the
intended lowering is different than this, and avoids the issue I'm trying to
highlight then my point is moot):
> 
> void main() {
>  tok = llvm.experimental.intrinsic_a();
> 
>  int i = omp_get_thread_num();
>  i32* val = alloca i32
>  compute_something_into_val(val, i);
>  a[i] = val;
> 
>  llvm.experimental.intrinsic_b(tok);
> }
> 
> However, LLVM is free to hoist the alloca to the entry block:
> 
> void main() {
>  i32* val = alloca i32
>  tok = llvm.experimental.intrinsic_a();
> 
>  int i = omp_get_thread_num();
>  compute_something_into_val(val, i);
>  a[i] = val;
> 
>  llvm.experimental.intrinsic_b(tok);
> }
> 
> and now you have a race between the four parallel forks.
> 
> The problem here is that nothing in the IR expresses that we have four
copies of the region running "at the same time".  In fact, such a
control flow is alien to LLVM today.
> 
> For instance, another evil optimization may turn:
> 
> void main() {
>  int a[4];
>  #pragma omp parallel num_threads(4)
>  {
>    int i = omp_get_thread_num();
>    int val = compute_something_into_val(i);
>    a[i] = val;
>  }
> 
>  return a[0] + a[1];
> }
> 
> to
> 
> void main() {
>  int a[4];
>  #pragma omp parallel num_threads(4)
>  {
>    int i = omp_get_thread_num();
>    int val = compute_something_into_val(i);
>    a[i] = val;
>  }
> 
>  return undef;
> }
> 
> since a[i] = val could have initialized at most one element in a.
> 
> Now you could say that the llvm.experimental.intrinsic_a and
llvm.experimental.intrinsic_b intrinsics are magic, and even such
"obvious" optimizations are not allowed to happen across them; but
then calls to these intrinsics is pretty fundamentally different from
"normal" calls, and are probably best modeled as new instructions.
> You're going to have to do the same kind of auditing of passes either
way, and the only extra cost of a new instruction is the extra bitcode reading /
writing code.
> 
> I hope I made sense.
> 
> -- Sanjoy
> 
>> Xinmin proposed, and then figure out what needs to be first class 
>> instructions.
> 
>> 
>> --Vikram Adve
>> 
>> 
>> 
>>> On Jan 19, 2017, at 11:03 PM, Sanjoy Das <sanjoy at
playingwithpointers.com> wrote:
>>> 
>>> Hi,
>>> 
>>> My bias is to use both (b) and (d), since they have complementary 
>>> strengths.  We should use (b) for expressing concepts that
can't be
>>> semantically modeled as a call or invoke (this branch takes both
its
>>> successors), and (d) for expressing things that can be (this call
may
>>> never return), and annotation like things (this region (denoted by 
>>> def-use of a token) is a reduction).
>>> 
>>> I don't grok OpenMP, but perhaps we can come with one or two 
>>> "generalized control flow"-type instructions that can be
used to
>>> model the non-call/invoke like semantics we'd like LLVM to know
>>> about, and model the rest with (d)?
>>> 
>>> -- Sanjoy
>>> 
>>> On Thu, Jan 19, 2017 at 8:28 PM, Hal Finkel via llvm-dev 
>>> <llvm-dev at lists.llvm.org> wrote:
>>>> 
>>>> On 01/19/2017 03:36 PM, Mehdi Amini via llvm-dev wrote:
>>>> 
>>>> 
>>>> On Jan 19, 2017, at 1:32 PM, Daniel Berlin <dberlin at
dberlin.org> wrote:
>>>> 
>>>> 
>>>> 
>>>>> On Thu, Jan 19, 2017 at 1:12 PM, Mehdi Amini
<mehdi.amini at apple.com> wrote:
>>>>> 
>>>>> 
>>>>> On Jan 19, 2017, at 12:04 PM, Daniel Berlin <dberlin at
dberlin.org> wrote:
>>>>> 
>>>>> 
>>>>> 
>>>>> On Thu, Jan 19, 2017 at 11:46 AM, Mehdi Amini via llvm-dev 
>>>>> <llvm-dev at lists.llvm.org> wrote:
>>>>>> 
>>>>>> 
>>>>>>> On Jan 19, 2017, at 11:36 AM, Adve, Vikram Sadanand
via llvm-dev
>>>>>>> <llvm-dev at lists.llvm.org> wrote:
>>>>>>> 
>>>>>>> Hi Johannes,
>>>>>>> 
>>>>>>>> I am especially curious where you get your data
from. Tapir [0]
>>>>>>>> (and to some degree PIR [1]) have shown that, 
>>>>>>>> counterintuitively, only a few changes to LLVM
passes are
>>>>>>>> needed. Tapir was recently used in an MIT class
with a lot of
>>>>>>>> students and it seemed to work well with only
minimal changes to
>>>>>>>> analysis and especially transformation passes.
>>>>>>> 
>>>>>>> TAPIR is an elegant, small extension and, in
particular, I think
>>>>>>> the idea of asymmetric parallel tasks and control
flow is a
>>>>>>> clever way to express parallelism with serial
semantics, as in
>>>>>>> Cilk.  Encoding the control flow extensions as
explicit
>>>>>>> instructions is orthogonal to that, though arguably
more elegant than using region tags + metadata.
>>>>>>> 
>>>>>>> However, Cilk is a tiny language compared with the
full
>>>>>>> complexity of other languages, like OpenMP.  To
take just one
>>>>>>> example, TAPIR cannot express the ORDERED construct
of OpenMP.  A
>>>>>>> more serious concern, IMO, is that TAPIR (like
Cilk) requires
>>>>>>> serial semantics, whereas there are many parallel
languages, OpenMP included, that do not obey that restriction.
>>>>>>> Third, OpenMP has *numerous* clauses, e.g.,
REDUCTION or PRIVATE,
>>>>>>> that are needed because without that, you’d be
dependent on
>>>>>>> fundamentally hard compiler analyses to extract the
same
>>>>>>> information for satisfactory parallel performance;
realistic
>>>>>>> applications cannot depend on the success of such
analyses.
>>>>>> 
>>>>>> I agree with this, but I’m also wondering if it needs
to be first
>>>>>> class in the IR?
>>>>>> For example we know our alias analysis is very basic,
and C/C++
>>>>>> have a higher constraint thanks to their type system,
but we
>>>>>> didn’t inject this higher level information that helps
the
>>>>>> optimizer as first class IR constructs.
>>>>> 
>>>>> 
>>>>> FWIW, while i agree with the general point, i wouldn't
use this example.
>>>>> Because we pretty much still suffer to this day because of
it (both
>>>>> in AA, and devirt, and ...)  :) We can't always even
tell fields
>>>>> apart
>>>>> 
>>>>> 
>>>>> Is it inherent to the infrastructure, i.e. using metadata
instead
>>>>> of first class IR construct or is it just a “quality of
implementation” issue?
>>>> 
>>>> Not to derail this conversation:
>>>> 
>>>> IMHO, At some point there is no real difference :)
>>>> 
>>>> Because otherwise, everything is a QOI issue.
>>>> 
>>>> IE if it's super tricky to get metadata that works well and
works
>>>> right, doesn't get lost, etc, and that's inherent to
using metadata,
>>>> that to me is not a QOI issue.
>>>> 
>>>> So could it be done with metadata? Probably?
>>>> But at the same time,  if it had been done with more first
class
>>>> constructs, it would have happened years ago  and been much
lower cost.
>>>> 
>>>> 
>>>> This is what I meant by “inherent to the infrastructure”,
thanks for
>>>> clarifying.
>>>> 
>>>> 
>>>> To clarify, we were proposing metadata that is used as
arguments to
>>>> the region-annotation intrinsics. This metadata has the nice 
>>>> property that it does not get dropped (so it is just being used
as a
>>>> way of encoding whatever data structures are necessary without
predefining a syntactic schema).
>>>> 
>>>> -Hal
>>>> 
>>>> 
>>>> —
>>>> Mehdi
>>>> 
>>>> 
>>>> 
>>>> 
>>>> _______________________________________________
>>>> LLVM Developers mailing list
>>>> llvm-dev at lists.llvm.org
>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>> 
>>>> 
>>>> --
>>>> Hal Finkel
>>>> Lead, Compiler Technology and Programming Languages Leadership 
>>>> Computing Facility Argonne National Laboratory
>>>> 
>>>> 
>>>> _______________________________________________
>>>> LLVM Developers mailing list
>>>> llvm-dev at lists.llvm.org
>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>> 
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

Yonghong Yan via llvm-dev

2017-Jan-20 19:21 UTC

head link

[llvm-dev] [RFC] IR-level Region Annotations

Xinmin,

outlining turns a parallel program into a sequential one from compiler's
perspective, and that is why most of the parallel-ignorant pass would hurt.
In your IR description for Sanjoy's example, does the current approach of
outlining impacting the way of the IR should be enhanced for parallelism?

For that specific example (or other analysis and optimization SPMD) and
what is implemented in clang, I am not sure whether we are going to change
the frontend so not to outline the parallel region, or allow to perform
certain optimization such as hoisting that alloca in clang which is not
desired I believe. Or annotate the outlined function together with the
intrinsic_a so that hoisting can be performed, in which case the
instrisic_a would like this:

tok = llvm.experimental.intrinsic_a()[ "DIR.PARALLEL"(),
"QUAL.PRIVATE"(i32* %val, i32 %i), "QUAL.NUM_THREADS"(i32 4)
plus info for
OUTLINED_call.

Mehdi,

I think i am asking the same question as you asked me.

Yonghong


>
> On Fri, Jan 20, 2017 at 1:54 PM, Mehdi Amini via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
>>
>> > On Jan 20, 2017, at 10:44 AM, Tian, Xinmin via llvm-dev <
>> llvm-dev at lists.llvm.org> wrote:
>> >
>> > Sanjoy, the IR would be like something below. It is ok to hoist
alloca
>> instruction outside the region. There are some small changes in
optimizer
>> to understand region-annotation intrinsic.
>> >
>> > { void main() {
>> >  i32* val = alloca i32
>> >  tok = llvm.experimental.intrinsic_a()[
"DIR.PARALLEL"(),
>> "QUAL.PRIVATE"(i32* val), "QUAL.NUM_THREADS"(i32
4)]
>> >
>> >  int i = omp_get_thread_num();
>> >  compute_something_into_val(val, i);
>> >  a[i] = val;
>> >
>> > 
llvm.experimental.intrinsic_b(tok)["DIR.END.PARALLEL"()];
>> > }
>> >
>> > With above representation, we can do privatization and outlining
as
>> below
>> >
>> > { void main() {
>> >  i32* val = alloca i32
>> >  i32* I = alloca 32
>> >  tok = llvm.experimental.intrinsic_a()[
"DIR.PARALLEL"(),
>> "QUAL.PRIVATE"(i32* %val, i32 %i),
"QUAL.NUM_THREADS"(i32 4)]
>> >
>> >  %ii = omp_get_thread_num();
>> >  compute_something_into_val(%val, %i);
>> >  a[i] = %val;
>> >
>> > 
llvm.experimental.intrinsic_b(tok)["DIR.END.PARALLEL"()];
>> > }
>> >
>>
>> Here we come to the interesting part: the hoisting of  "i32* I =
alloca
>> 32” above the intrinsics required to update the intrinsics information
>> “QUAL.PRIVATE”.
>> This means that the optimizer has to be aware of it, I’m missing the
>> magic here?
>> I understand that an openmp specific optimization can do it, the
question
>> is how it an openmp agnostic supposed to behave in face of
>> llvm.experimental.intrinsic_a?
>>
>> —
>> Mehdi
>>
>>
>>
>> > 1. create i32* priv_val = alloca i32  %priv_i = ...in the region,
and
>> replace all  %val with %prv_val in the region.
>> > 2. perform outlining.
>> >
>> > Caller code
>> > ....
>> > omp_push_num_threads(4)
>> > omp_fork_call( .... outline_par_region....) ....
>> >
>> > Callee code:
>> > Outlined_par_rgion {
>> >   I32* priv_val = alloca 32
>> >   I32* priv_i = ....
>> >
>> >    Ret
>> > }
>> >
>> > For OpenMP, we do support it at -O0, -O1, -O2 and -O3.  We had to
make
>> sure it runs correctly w/ and w/o optimizations and advanced analysis.
So
>> we need to preserve all source information for BE.
>> > You can take a look our LLVM-HPC paper for a bit some details. 
There
>> are still tons of work to be done. Thanks.
>> >
>> > Xinmin
>> >
>> > -----Original Message-----
>> > From: llvm-dev [mailto:llvm-dev-bounces at lists.llvm.org] On
Behalf Of
>> Sanjoy Das via llvm-dev
>> > Sent: Thursday, January 19, 2017 10:13 PM
>> > To: Adve, Vikram Sadanand <vadve at illinois.edu>
>> > Cc: llvm-dev <llvm-dev at lists.llvm.org>; llvm-dev-request
at lists.llvm.org
>> > Subject: Re: [llvm-dev] [RFC] IR-level Region Annotations
>> >
>> > Hi Vikram,
>> >
>> > On Thu, Jan 19, 2017 at 9:27 PM, Adve, Vikram Sadanand <
>> vadve at illinois.edu> wrote:
>> >> Hi Sanjoy,
>> >>
>> >> Yes, that's exactly what we have been looking at recently
here, but
>> >> the region tags seem to make it possible to express the
control flow
>> >> as well, so I think we could start with reg ions+metadata, as
Hal and
>> >
>> > I'm not yet convinced that region tags are sufficient to model
exotic
>> control flow.
>> >
>> > (I don't know OpenMP so this is a copy-pasted-edited example)
>> >
>> > Say we have:
>> >
>> > void main() {
>> >  #pragma omp parallel num_threads(4)
>> >  {
>> >    int i = omp_get_thread_num();
>> >    int val;
>> >    compute_something_into_val(&val, i);
>> >    a[i] = val;
>> >  }
>> > }
>> >
>> > I presume the (eventual) intended lowering is something like this
(if
>> the intended lowering is different than this, and avoids the issue
I'm
>> trying to highlight then my point is moot):
>> >
>> > void main() {
>> >  tok = llvm.experimental.intrinsic_a();
>> >
>> >  int i = omp_get_thread_num();
>> >  i32* val = alloca i32
>> >  compute_something_into_val(val, i);
>> >  a[i] = val;
>> >
>> >  llvm.experimental.intrinsic_b(tok);
>> > }
>> >
>> > However, LLVM is free to hoist the alloca to the entry block:
>> >
>> > void main() {
>> >  i32* val = alloca i32
>> >  tok = llvm.experimental.intrinsic_a();
>> >
>> >  int i = omp_get_thread_num();
>> >  compute_something_into_val(val, i);
>> >  a[i] = val;
>> >
>> >  llvm.experimental.intrinsic_b(tok);
>> > }
>> >
>> > and now you have a race between the four parallel forks.
>> >
>> > The problem here is that nothing in the IR expresses that we have
four
>> copies of the region running "at the same time".  In fact,
such a control
>> flow is alien to LLVM today.
>> >
>> > For instance, another evil optimization may turn:
>> >
>> > void main() {
>> >  int a[4];
>> >  #pragma omp parallel num_threads(4)
>> >  {
>> >    int i = omp_get_thread_num();
>> >    int val = compute_something_into_val(i);
>> >    a[i] = val;
>> >  }
>> >
>> >  return a[0] + a[1];
>> > }
>> >
>> > to
>> >
>> > void main() {
>> >  int a[4];
>> >  #pragma omp parallel num_threads(4)
>> >  {
>> >    int i = omp_get_thread_num();
>> >    int val = compute_something_into_val(i);
>> >    a[i] = val;
>> >  }
>> >
>> >  return undef;
>> > }
>> >
>> > since a[i] = val could have initialized at most one element in a.
>> >
>> > Now you could say that the llvm.experimental.intrinsic_a and
>> llvm.experimental.intrinsic_b intrinsics are magic, and even such
"obvious"
>> optimizations are not allowed to happen across them; but then calls to
>> these intrinsics is pretty fundamentally different from
"normal" calls, and
>> are probably best modeled as new instructions.
>> > You're going to have to do the same kind of auditing of passes
either
>> way, and the only extra cost of a new instruction is the extra bitcode
>> reading / writing code.
>> >
>> > I hope I made sense.
>> >
>> > -- Sanjoy
>> >
>> >> Xinmin proposed, and then figure out what needs to be first
class
>> >> instructions.
>> >
>> >>
>> >> --Vikram Adve
>> >>
>> >>
>> >>
>> >>> On Jan 19, 2017, at 11:03 PM, Sanjoy Das <
>> sanjoy at playingwithpointers.com> wrote:
>> >>>
>> >>> Hi,
>> >>>
>> >>> My bias is to use both (b) and (d), since they have
complementary
>> >>> strengths.  We should use (b) for expressing concepts that
can't be
>> >>> semantically modeled as a call or invoke (this branch
takes both its
>> >>> successors), and (d) for expressing things that can be
(this call may
>> >>> never return), and annotation like things (this region
(denoted by
>> >>> def-use of a token) is a reduction).
>> >>>
>> >>> I don't grok OpenMP, but perhaps we can come with one
or two
>> >>> "generalized control flow"-type instructions
that can be used to
>> >>> model the non-call/invoke like semantics we'd like
LLVM to know
>> >>> about, and model the rest with (d)?
>> >>>
>> >>> -- Sanjoy
>> >>>
>> >>> On Thu, Jan 19, 2017 at 8:28 PM, Hal Finkel via llvm-dev
>> >>> <llvm-dev at lists.llvm.org> wrote:
>> >>>>
>> >>>> On 01/19/2017 03:36 PM, Mehdi Amini via llvm-dev
wrote:
>> >>>>
>> >>>>
>> >>>> On Jan 19, 2017, at 1:32 PM, Daniel Berlin <dberlin
at dberlin.org>
>> wrote:
>> >>>>
>> >>>>
>> >>>>
>> >>>>> On Thu, Jan 19, 2017 at 1:12 PM, Mehdi Amini
<mehdi.amini at apple.com>
>> wrote:
>> >>>>>
>> >>>>>
>> >>>>> On Jan 19, 2017, at 12:04 PM, Daniel Berlin
<dberlin at dberlin.org>
>> wrote:
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>> On Thu, Jan 19, 2017 at 11:46 AM, Mehdi Amini via
llvm-dev
>> >>>>> <llvm-dev at lists.llvm.org> wrote:
>> >>>>>>
>> >>>>>>
>> >>>>>>> On Jan 19, 2017, at 11:36 AM, Adve, Vikram
Sadanand via llvm-dev
>> >>>>>>> <llvm-dev at lists.llvm.org> wrote:
>> >>>>>>>
>> >>>>>>> Hi Johannes,
>> >>>>>>>
>> >>>>>>>> I am especially curious where you get
your data from. Tapir [0]
>> >>>>>>>> (and to some degree PIR [1]) have
shown that,
>> >>>>>>>> counterintuitively, only a few changes
to LLVM passes are
>> >>>>>>>> needed. Tapir was recently used in an
MIT class with a lot of
>> >>>>>>>> students and it seemed to work well
with only minimal changes to
>> >>>>>>>> analysis and especially transformation
passes.
>> >>>>>>>
>> >>>>>>> TAPIR is an elegant, small extension and,
in particular, I think
>> >>>>>>> the idea of asymmetric parallel tasks and
control flow is a
>> >>>>>>> clever way to express parallelism with
serial semantics, as in
>> >>>>>>> Cilk.  Encoding the control flow
extensions as explicit
>> >>>>>>> instructions is orthogonal to that, though
arguably more elegant
>> than using region tags + metadata.
>> >>>>>>>
>> >>>>>>> However, Cilk is a tiny language compared
with the full
>> >>>>>>> complexity of other languages, like
OpenMP.  To take just one
>> >>>>>>> example, TAPIR cannot express the ORDERED
construct of OpenMP.  A
>> >>>>>>> more serious concern, IMO, is that TAPIR
(like Cilk) requires
>> >>>>>>> serial semantics, whereas there are many
parallel languages,
>> OpenMP included, that do not obey that restriction.
>> >>>>>>> Third, OpenMP has *numerous* clauses,
e.g., REDUCTION or PRIVATE,
>> >>>>>>> that are needed because without that,
you’d be dependent on
>> >>>>>>> fundamentally hard compiler analyses to
extract the same
>> >>>>>>> information for satisfactory parallel
performance; realistic
>> >>>>>>> applications cannot depend on the success
of such analyses.
>> >>>>>>
>> >>>>>> I agree with this, but I’m also wondering if
it needs to be first
>> >>>>>> class in the IR?
>> >>>>>> For example we know our alias analysis is very
basic, and C/C++
>> >>>>>> have a higher constraint thanks to their type
system, but we
>> >>>>>> didn’t inject this higher level information
that helps the
>> >>>>>> optimizer as first class IR constructs.
>> >>>>>
>> >>>>>
>> >>>>> FWIW, while i agree with the general point, i
wouldn't use this
>> example.
>> >>>>> Because we pretty much still suffer to this day
because of it (both
>> >>>>> in AA, and devirt, and ...)  :) We can't
always even tell fields
>> >>>>> apart
>> >>>>>
>> >>>>>
>> >>>>> Is it inherent to the infrastructure, i.e. using
metadata instead
>> >>>>> of first class IR construct or is it just a
“quality of
>> implementation” issue?
>> >>>>
>> >>>> Not to derail this conversation:
>> >>>>
>> >>>> IMHO, At some point there is no real difference :)
>> >>>>
>> >>>> Because otherwise, everything is a QOI issue.
>> >>>>
>> >>>> IE if it's super tricky to get metadata that works
well and works
>> >>>> right, doesn't get lost, etc, and that's
inherent to using metadata,
>> >>>> that to me is not a QOI issue.
>> >>>>
>> >>>> So could it be done with metadata? Probably?
>> >>>> But at the same time,  if it had been done with more
first class
>> >>>> constructs, it would have happened years ago  and been
much lower
>> cost.
>> >>>>
>> >>>>
>> >>>> This is what I meant by “inherent to the
infrastructure”, thanks for
>> >>>> clarifying.
>> >>>>
>> >>>>
>> >>>> To clarify, we were proposing metadata that is used as
arguments to
>> >>>> the region-annotation intrinsics. This metadata has
the nice
>> >>>> property that it does not get dropped (so it is just
being used as a
>> >>>> way of encoding whatever data structures are necessary
without
>> predefining a syntactic schema).
>> >>>>
>> >>>> -Hal
>> >>>>
>> >>>>
>> >>>> —
>> >>>> Mehdi
>> >>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>> _______________________________________________
>> >>>> LLVM Developers mailing list
>> >>>> llvm-dev at lists.llvm.org
>> >>>>
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>> >>>>
>> >>>>
>> >>>> --
>> >>>> Hal Finkel
>> >>>> Lead, Compiler Technology and Programming Languages
Leadership
>> >>>> Computing Facility Argonne National Laboratory
>> >>>>
>> >>>>
>> >>>> _______________________________________________
>> >>>> LLVM Developers mailing list
>> >>>> llvm-dev at lists.llvm.org
>> >>>>
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>> >>>>
>> > _______________________________________________
>> > LLVM Developers mailing list
>> > llvm-dev at lists.llvm.org
>> > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>> > _______________________________________________
>> > LLVM Developers mailing list
>> > llvm-dev at lists.llvm.org
>> > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170120/c37be328/attachment.html>

Tian, Xinmin via llvm-dev

2017-Jan-20 19:22 UTC

head link

[llvm-dev] [RFC] IR-level Region Annotations

Yonghong, In our implementation (not open sourced), we don’t do outlining the
Front-End.  See my previous reply to Medhi’s email.

Xinmin

From: Yonghong Yan [mailto:yan at oakland.edu]
Sent: Friday, January 20, 2017 11:18 AM
To: Mehdi Amini
Cc: Tian, Xinmin; llvm-dev at lists.llvm.org; llvm-dev-request at
lists.llvm.org; Adve, Vikram Sadanand
Subject: Re: [llvm-dev] [RFC] IR-level Region Annotations

Xinmin,

outlining turns a parallel program into a sequential one from compiler's
perspective, and that is why most of the parallel-ignorant pass would hurt. In
your IR description for Sanjoy's example, does the current approach of
outlining impacting the way of the IR should be enhanced for parallelism?

For that specific example (or other analysis and optimization SPMD) and what is
implemented in clang, I am not sure whether we are going to change the frontend
so not to outline the parallel region, or allow to perform certain optimization
such as hoisting that alloca in clang which is not desired I believe. Or
annotate the outlined function together with the intrinsic_a so that hoisting
can be performed, in which case the instrisic_a would like this:

tok = llvm.experimental.intrinsic_a()[ "DIR.PARALLEL"(),
"QUAL.PRIVATE"(i32* %val, i32 %i), "QUAL.NUM_THREADS"(i32 4)
plus info for OUTLINED_call.

Mehdi,

I think i am asking the same question as you asked me.

Yonghong



On Fri, Jan 20, 2017 at 1:54 PM, Mehdi Amini via llvm-dev <llvm-dev at
lists.llvm.org<mailto:llvm-dev at lists.llvm.org>> wrote:
> On Jan 20, 2017, at 10:44 AM, Tian, Xinmin via llvm-dev <llvm-dev at
lists.llvm.org<mailto:llvm-dev at lists.llvm.org>> wrote:
>
> Sanjoy, the IR would be like something below. It is ok to hoist alloca
instruction outside the region. There are some small changes in optimizer to
understand region-annotation intrinsic.
>
> { void main() {
>  i32* val = alloca i32
>  tok = llvm.experimental.intrinsic_a()[ "DIR.PARALLEL"(),
"QUAL.PRIVATE"(i32* val), "QUAL.NUM_THREADS"(i32 4)]
>
>  int i = omp_get_thread_num();
>  compute_something_into_val(val, i);
>  a[i] = val;
>
>  llvm.experimental.intrinsic_b(tok)["DIR.END.PARALLEL"()];
> }
>
> With above representation, we can do privatization and outlining as below
>
> { void main() {
>  i32* val = alloca i32
>  i32* I = alloca 32
>  tok = llvm.experimental.intrinsic_a()[ "DIR.PARALLEL"(),
"QUAL.PRIVATE"(i32* %val, i32 %i), "QUAL.NUM_THREADS"(i32
4)]
>
>  %ii = omp_get_thread_num();
>  compute_something_into_val(%val, %i);
>  a[i] = %val;
>
>  llvm.experimental.intrinsic_b(tok)["DIR.END.PARALLEL"()];
> }
>
Here we come to the interesting part: the hoisting of  "i32* I = alloca 32”
above the intrinsics required to update the intrinsics information
“QUAL.PRIVATE”.
This means that the optimizer has to be aware of it, I’m missing the magic here?
I understand that an openmp specific optimization can do it, the question is how
it an openmp agnostic supposed to behave in face of
llvm.experimental.intrinsic_a?

—
Mehdi


> 1. create i32* priv_val = alloca i32  %priv_i = ...in the region, and
replace all  %val with %prv_val in the region.
> 2. perform outlining.
>
> Caller code
> ....
> omp_push_num_threads(4)
> omp_fork_call( .... outline_par_region....) ....
>
> Callee code:
> Outlined_par_rgion {
>   I32* priv_val = alloca 32
>   I32* priv_i = ....
>
>    Ret
> }
>
> For OpenMP, we do support it at -O0, -O1, -O2 and -O3.  We had to make sure
it runs correctly w/ and w/o optimizations and advanced analysis. So we need to
preserve all source information for BE.
> You can take a look our LLVM-HPC paper for a bit some details.  There are
still tons of work to be done. Thanks.
>
> Xinmin
>
> -----Original Message-----
> From: llvm-dev [mailto:llvm-dev-bounces at
lists.llvm.org<mailto:llvm-dev-bounces at lists.llvm.org>] On Behalf Of
Sanjoy Das via llvm-dev
> Sent: Thursday, January 19, 2017 10:13 PM
> To: Adve, Vikram Sadanand <vadve at illinois.edu<mailto:vadve at
illinois.edu>>
> Cc: llvm-dev <llvm-dev at lists.llvm.org<mailto:llvm-dev at
lists.llvm.org>>; llvm-dev-request at
lists.llvm.org<mailto:llvm-dev-request at lists.llvm.org>
> Subject: Re: [llvm-dev] [RFC] IR-level Region Annotations
>
> Hi Vikram,
>
> On Thu, Jan 19, 2017 at 9:27 PM, Adve, Vikram Sadanand <vadve at
illinois.edu<mailto:vadve at illinois.edu>> wrote:
>> Hi Sanjoy,
>>
>> Yes, that's exactly what we have been looking at recently here, but
>> the region tags seem to make it possible to express the control flow
>> as well, so I think we could start with reg ions+metadata, as Hal and
>
> I'm not yet convinced that region tags are sufficient to model exotic
control flow.
>
> (I don't know OpenMP so this is a copy-pasted-edited example)
>
> Say we have:
>
> void main() {
>  #pragma omp parallel num_threads(4)
>  {
>    int i = omp_get_thread_num();
>    int val;
>    compute_something_into_val(&val, i);
>    a[i] = val;
>  }
> }
>
> I presume the (eventual) intended lowering is something like this (if the
intended lowering is different than this, and avoids the issue I'm trying to
highlight then my point is moot):
>
> void main() {
>  tok = llvm.experimental.intrinsic_a();
>
>  int i = omp_get_thread_num();
>  i32* val = alloca i32
>  compute_something_into_val(val, i);
>  a[i] = val;
>
>  llvm.experimental.intrinsic_b(tok);
> }
>
> However, LLVM is free to hoist the alloca to the entry block:
>
> void main() {
>  i32* val = alloca i32
>  tok = llvm.experimental.intrinsic_a();
>
>  int i = omp_get_thread_num();
>  compute_something_into_val(val, i);
>  a[i] = val;
>
>  llvm.experimental.intrinsic_b(tok);
> }
>
> and now you have a race between the four parallel forks.
>
> The problem here is that nothing in the IR expresses that we have four
copies of the region running "at the same time".  In fact, such a
control flow is alien to LLVM today.
>
> For instance, another evil optimization may turn:
>
> void main() {
>  int a[4];
>  #pragma omp parallel num_threads(4)
>  {
>    int i = omp_get_thread_num();
>    int val = compute_something_into_val(i);
>    a[i] = val;
>  }
>
>  return a[0] + a[1];
> }
>
> to
>
> void main() {
>  int a[4];
>  #pragma omp parallel num_threads(4)
>  {
>    int i = omp_get_thread_num();
>    int val = compute_something_into_val(i);
>    a[i] = val;
>  }
>
>  return undef;
> }
>
> since a[i] = val could have initialized at most one element in a.
>
> Now you could say that the llvm.experimental.intrinsic_a and
llvm.experimental.intrinsic_b intrinsics are magic, and even such
"obvious" optimizations are not allowed to happen across them; but
then calls to these intrinsics is pretty fundamentally different from
"normal" calls, and are probably best modeled as new instructions.
> You're going to have to do the same kind of auditing of passes either
way, and the only extra cost of a new instruction is the extra bitcode reading /
writing code.
>
> I hope I made sense.
>
> -- Sanjoy
>
>> Xinmin proposed, and then figure out what needs to be first class
>> instructions.
>
>>
>> --Vikram Adve
>>
>>
>>
>>> On Jan 19, 2017, at 11:03 PM, Sanjoy Das <sanjoy at
playingwithpointers.com<mailto:sanjoy at playingwithpointers.com>>
wrote:
>>>
>>> Hi,
>>>
>>> My bias is to use both (b) and (d), since they have complementary
>>> strengths.  We should use (b) for expressing concepts that
can't be
>>> semantically modeled as a call or invoke (this branch takes both
its
>>> successors), and (d) for expressing things that can be (this call
may
>>> never return), and annotation like things (this region (denoted by
>>> def-use of a token) is a reduction).
>>>
>>> I don't grok OpenMP, but perhaps we can come with one or two
>>> "generalized control flow"-type instructions that can be
used to
>>> model the non-call/invoke like semantics we'd like LLVM to know
>>> about, and model the rest with (d)?
>>>
>>> -- Sanjoy
>>>
>>> On Thu, Jan 19, 2017 at 8:28 PM, Hal Finkel via llvm-dev
>>> <llvm-dev at lists.llvm.org<mailto:llvm-dev at
lists.llvm.org>> wrote:
>>>>
>>>> On 01/19/2017 03:36 PM, Mehdi Amini via llvm-dev wrote:
>>>>
>>>>
>>>> On Jan 19, 2017, at 1:32 PM, Daniel Berlin <dberlin at
dberlin.org<mailto:dberlin at dberlin.org>> wrote:
>>>>
>>>>
>>>>
>>>>> On Thu, Jan 19, 2017 at 1:12 PM, Mehdi Amini
<mehdi.amini at apple.com<mailto:mehdi.amini at apple.com>> wrote:
>>>>>
>>>>>
>>>>> On Jan 19, 2017, at 12:04 PM, Daniel Berlin <dberlin at
dberlin.org<mailto:dberlin at dberlin.org>> wrote:
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Jan 19, 2017 at 11:46 AM, Mehdi Amini via llvm-dev
>>>>> <llvm-dev at lists.llvm.org<mailto:llvm-dev at
lists.llvm.org>> wrote:
>>>>>>
>>>>>>
>>>>>>> On Jan 19, 2017, at 11:36 AM, Adve, Vikram Sadanand
via llvm-dev
>>>>>>> <llvm-dev at lists.llvm.org<mailto:llvm-dev
at lists.llvm.org>> wrote:
>>>>>>>
>>>>>>> Hi Johannes,
>>>>>>>
>>>>>>>> I am especially curious where you get your data
from. Tapir [0]
>>>>>>>> (and to some degree PIR [1]) have shown that,
>>>>>>>> counterintuitively, only a few changes to LLVM
passes are
>>>>>>>> needed. Tapir was recently used in an MIT class
with a lot of
>>>>>>>> students and it seemed to work well with only
minimal changes to
>>>>>>>> analysis and especially transformation passes.
>>>>>>>
>>>>>>> TAPIR is an elegant, small extension and, in
particular, I think
>>>>>>> the idea of asymmetric parallel tasks and control
flow is a
>>>>>>> clever way to express parallelism with serial
semantics, as in
>>>>>>> Cilk.  Encoding the control flow extensions as
explicit
>>>>>>> instructions is orthogonal to that, though arguably
more elegant than using region tags + metadata.
>>>>>>>
>>>>>>> However, Cilk is a tiny language compared with the
full
>>>>>>> complexity of other languages, like OpenMP.  To
take just one
>>>>>>> example, TAPIR cannot express the ORDERED construct
of OpenMP.  A
>>>>>>> more serious concern, IMO, is that TAPIR (like
Cilk) requires
>>>>>>> serial semantics, whereas there are many parallel
languages, OpenMP included, that do not obey that restriction.
>>>>>>> Third, OpenMP has *numerous* clauses, e.g.,
REDUCTION or PRIVATE,
>>>>>>> that are needed because without that, you’d be
dependent on
>>>>>>> fundamentally hard compiler analyses to extract the
same
>>>>>>> information for satisfactory parallel performance;
realistic
>>>>>>> applications cannot depend on the success of such
analyses.
>>>>>>
>>>>>> I agree with this, but I’m also wondering if it needs
to be first
>>>>>> class in the IR?
>>>>>> For example we know our alias analysis is very basic,
and C/C++
>>>>>> have a higher constraint thanks to their type system,
but we
>>>>>> didn’t inject this higher level information that helps
the
>>>>>> optimizer as first class IR constructs.
>>>>>
>>>>>
>>>>> FWIW, while i agree with the general point, i wouldn't
use this example.
>>>>> Because we pretty much still suffer to this day because of
it (both
>>>>> in AA, and devirt, and ...)  :) We can't always even
tell fields
>>>>> apart
>>>>>
>>>>>
>>>>> Is it inherent to the infrastructure, i.e. using metadata
instead
>>>>> of first class IR construct or is it just a “quality of
implementation” issue?
>>>>
>>>> Not to derail this conversation:
>>>>
>>>> IMHO, At some point there is no real difference :)
>>>>
>>>> Because otherwise, everything is a QOI issue.
>>>>
>>>> IE if it's super tricky to get metadata that works well and
works
>>>> right, doesn't get lost, etc, and that's inherent to
using metadata,
>>>> that to me is not a QOI issue.
>>>>
>>>> So could it be done with metadata? Probably?
>>>> But at the same time,  if it had been done with more first
class
>>>> constructs, it would have happened years ago  and been much
lower cost.
>>>>
>>>>
>>>> This is what I meant by “inherent to the infrastructure”,
thanks for
>>>> clarifying.
>>>>
>>>>
>>>> To clarify, we were proposing metadata that is used as
arguments to
>>>> the region-annotation intrinsics. This metadata has the nice
>>>> property that it does not get dropped (so it is just being used
as a
>>>> way of encoding whatever data structures are necessary without
predefining a syntactic schema).
>>>>
>>>> -Hal
>>>>
>>>>
>>>> —
>>>> Mehdi
>>>>
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> LLVM Developers mailing list
>>>> llvm-dev at lists.llvm.org<mailto:llvm-dev at
lists.llvm.org>
>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>
>>>>
>>>> --
>>>> Hal Finkel
>>>> Lead, Compiler Technology and Programming Languages Leadership
>>>> Computing Facility Argonne National Laboratory
>>>>
>>>>
>>>> _______________________________________________
>>>> LLVM Developers mailing list
>>>> llvm-dev at lists.llvm.org<mailto:llvm-dev at
lists.llvm.org>
>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
_______________________________________________
LLVM Developers mailing list
llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170120/73295a48/attachment-0001.html>

Sanjoy Das via llvm-dev

2017-Jan-20 19:35 UTC

head link

[llvm-dev] [RFC] IR-level Region Annotations

Hi,

I'm going to club together some responses.

I agree that outlining function sub-bodies and passing in the function
pointers to said outlined bodies to OpenMP helpers lets us correctly
implement the semantics we need.  However, unless I severely
misunderstood the thread, I thought the key idea was to move *away*
from that representation and towards a representation that _allows_
optimization?

My problem with representing parallel regions with
intrinsic-denoted-regions is that we're lying to the optimizer about
what the code actually does.  Calls, even to intrinsics, can "at
worst" do some combination of the following:

 - Write to and read from arbitrary memory
 - Have UB (but we're allowed to pretend that they don't)
 - Throw an exception
 - Never return, either by infinite looping or by calling exit(0)
 - Have memory synchronization operations, like fences, atomic loads,
   stores etc.
 - Have side effects like IO, volatile writes

If an intrinsic's behavior can be explained by some subset of the
above, then you should not need to edit any pass to preserve
_correctness_ -- all optimization passes (today) conservatively assume
that calls that they don't understand have all of the behaviors
outlined above.

However, if to preserve *correctness* you have to edit optimization
passes and teach them that certain intrinsic calls have behavior
*outside* the set mentioned above then the instruction really does not
have "call semantics".  `call @llvm.experimental.region_begin()` is
really a fundamentally new instruction masquerading as an intrinsic,
and it is probably better to call a spade a spade and represent it as
a new instruction.

The setting for the examples I gave was not that "here is a case we
need to get right".  The setting was that "here is a *symptom* that
shows that we've lied to the optimizer".  We can go ahead and fix all
the symptoms by adding bailouts to the respective passes, but that
does not make us immune to passes that we don't know about
e.g. downstream passes, passes that will be added later.  It also puts
us in a weird spot around semantics of call instructions.

-- Sanjoy

On Fri, Jan 20, 2017 at 11:22 AM, Tian, Xinmin via llvm-dev
<llvm-dev at lists.llvm.org> wrote:> Yonghong, In our implementation (not open sourced), we don’t do outlining
> the Front-End.  See my previous reply to Medhi’s email.
>
>
>
> Xinmin
>
>
>
> From: Yonghong Yan [mailto:yan at oakland.edu]
> Sent: Friday, January 20, 2017 11:18 AM
> To: Mehdi Amini
> Cc: Tian, Xinmin; llvm-dev at lists.llvm.org; llvm-dev-request at
lists.llvm.org;
> Adve, Vikram Sadanand
> Subject: Re: [llvm-dev] [RFC] IR-level Region Annotations
>
>
>
> Xinmin,
>
>
>
> outlining turns a parallel program into a sequential one from
compiler's
> perspective, and that is why most of the parallel-ignorant pass would hurt.
> In your IR description for Sanjoy's example, does the current approach
of
> outlining impacting the way of the IR should be enhanced for parallelism?
>
>
>
> For that specific example (or other analysis and optimization SPMD) and
what
> is implemented in clang, I am not sure whether we are going to change the
> frontend so not to outline the parallel region, or allow to perform certain
> optimization such as hoisting that alloca in clang which is not desired I
> believe. Or annotate the outlined function together with the intrinsic_a so
> that hoisting can be performed, in which case the instrisic_a would like
> this:
>
>
>
> tok = llvm.experimental.intrinsic_a()[ "DIR.PARALLEL"(),
"QUAL.PRIVATE"(i32*
> %val, i32 %i), "QUAL.NUM_THREADS"(i32 4) plus info for
OUTLINED_call.
>
>
>
> Mehdi,
>
>
>
> I think i am asking the same question as you asked me.
>
>
>
> Yonghong
>
>
>
>
>
>
>
> On Fri, Jan 20, 2017 at 1:54 PM, Mehdi Amini via llvm-dev
> <llvm-dev at lists.llvm.org> wrote:
>
>
>> On Jan 20, 2017, at 10:44 AM, Tian, Xinmin via llvm-dev
>> <llvm-dev at lists.llvm.org> wrote:
>>
>> Sanjoy, the IR would be like something below. It is ok to hoist alloca
>> instruction outside the region. There are some small changes in
optimizer to
>> understand region-annotation intrinsic.
>>
>> { void main() {
>>  i32* val = alloca i32
>>  tok = llvm.experimental.intrinsic_a()[ "DIR.PARALLEL"(),
>> "QUAL.PRIVATE"(i32* val), "QUAL.NUM_THREADS"(i32
4)]
>>
>>  int i = omp_get_thread_num();
>>  compute_something_into_val(val, i);
>>  a[i] = val;
>>
>>  llvm.experimental.intrinsic_b(tok)["DIR.END.PARALLEL"()];
>> }
>>
>> With above representation, we can do privatization and outlining as
below
>>
>> { void main() {
>>  i32* val = alloca i32
>>  i32* I = alloca 32
>>  tok = llvm.experimental.intrinsic_a()[ "DIR.PARALLEL"(),
>> "QUAL.PRIVATE"(i32* %val, i32 %i),
"QUAL.NUM_THREADS"(i32 4)]
>>
>>  %ii = omp_get_thread_num();
>>  compute_something_into_val(%val, %i);
>>  a[i] = %val;
>>
>>  llvm.experimental.intrinsic_b(tok)["DIR.END.PARALLEL"()];
>> }
>>
>
> Here we come to the interesting part: the hoisting of  "i32* I =
alloca 32”
> above the intrinsics required to update the intrinsics information
> “QUAL.PRIVATE”.
> This means that the optimizer has to be aware of it, I’m missing the magic
> here?
> I understand that an openmp specific optimization can do it, the question
is
> how it an openmp agnostic supposed to behave in face of
> llvm.experimental.intrinsic_a?
>
> —
> Mehdi
>
>
>
>
>> 1. create i32* priv_val = alloca i32  %priv_i = ...in the region, and
>> replace all  %val with %prv_val in the region.
>> 2. perform outlining.
>>
>> Caller code
>> ....
>> omp_push_num_threads(4)
>> omp_fork_call( .... outline_par_region....) ....
>>
>> Callee code:
>> Outlined_par_rgion {
>>   I32* priv_val = alloca 32
>>   I32* priv_i = ....
>>
>>    Ret
>> }
>>
>> For OpenMP, we do support it at -O0, -O1, -O2 and -O3.  We had to make
>> sure it runs correctly w/ and w/o optimizations and advanced analysis.
So we
>> need to preserve all source information for BE.
>> You can take a look our LLVM-HPC paper for a bit some details.  There
are
>> still tons of work to be done. Thanks.
>>
>> Xinmin
>>
>> -----Original Message-----
>> From: llvm-dev [mailto:llvm-dev-bounces at lists.llvm.org] On Behalf Of
>> Sanjoy Das via llvm-dev
>> Sent: Thursday, January 19, 2017 10:13 PM
>> To: Adve, Vikram Sadanand <vadve at illinois.edu>
>> Cc: llvm-dev <llvm-dev at lists.llvm.org>; llvm-dev-request at
lists.llvm.org
>> Subject: Re: [llvm-dev] [RFC] IR-level Region Annotations
>>
>> Hi Vikram,
>>
>> On Thu, Jan 19, 2017 at 9:27 PM, Adve, Vikram Sadanand
>> <vadve at illinois.edu> wrote:
>>> Hi Sanjoy,
>>>
>>> Yes, that's exactly what we have been looking at recently here,
but
>>> the region tags seem to make it possible to express the control
flow
>>> as well, so I think we could start with reg ions+metadata, as Hal
and
>>
>> I'm not yet convinced that region tags are sufficient to model
exotic
>> control flow.
>>
>> (I don't know OpenMP so this is a copy-pasted-edited example)
>>
>> Say we have:
>>
>> void main() {
>>  #pragma omp parallel num_threads(4)
>>  {
>>    int i = omp_get_thread_num();
>>    int val;
>>    compute_something_into_val(&val, i);
>>    a[i] = val;
>>  }
>> }
>>
>> I presume the (eventual) intended lowering is something like this (if
the
>> intended lowering is different than this, and avoids the issue I'm
trying to
>> highlight then my point is moot):
>>
>> void main() {
>>  tok = llvm.experimental.intrinsic_a();
>>
>>  int i = omp_get_thread_num();
>>  i32* val = alloca i32
>>  compute_something_into_val(val, i);
>>  a[i] = val;
>>
>>  llvm.experimental.intrinsic_b(tok);
>> }
>>
>> However, LLVM is free to hoist the alloca to the entry block:
>>
>> void main() {
>>  i32* val = alloca i32
>>  tok = llvm.experimental.intrinsic_a();
>>
>>  int i = omp_get_thread_num();
>>  compute_something_into_val(val, i);
>>  a[i] = val;
>>
>>  llvm.experimental.intrinsic_b(tok);
>> }
>>
>> and now you have a race between the four parallel forks.
>>
>> The problem here is that nothing in the IR expresses that we have four
>> copies of the region running "at the same time".  In fact,
such a control
>> flow is alien to LLVM today.
>>
>> For instance, another evil optimization may turn:
>>
>> void main() {
>>  int a[4];
>>  #pragma omp parallel num_threads(4)
>>  {
>>    int i = omp_get_thread_num();
>>    int val = compute_something_into_val(i);
>>    a[i] = val;
>>  }
>>
>>  return a[0] + a[1];
>> }
>>
>> to
>>
>> void main() {
>>  int a[4];
>>  #pragma omp parallel num_threads(4)
>>  {
>>    int i = omp_get_thread_num();
>>    int val = compute_something_into_val(i);
>>    a[i] = val;
>>  }
>>
>>  return undef;
>> }
>>
>> since a[i] = val could have initialized at most one element in a.
>>
>> Now you could say that the llvm.experimental.intrinsic_a and
>> llvm.experimental.intrinsic_b intrinsics are magic, and even such
"obvious"
>> optimizations are not allowed to happen across them; but then calls to
these
>> intrinsics is pretty fundamentally different from "normal"
calls, and are
>> probably best modeled as new instructions.
>> You're going to have to do the same kind of auditing of passes
either way,
>> and the only extra cost of a new instruction is the extra bitcode
reading /
>> writing code.
>>
>> I hope I made sense.
>>
>> -- Sanjoy
>>
>>> Xinmin proposed, and then figure out what needs to be first class
>>> instructions.
>>
>>>
>>> --Vikram Adve
>>>
>>>
>>>
>>>> On Jan 19, 2017, at 11:03 PM, Sanjoy Das
>>>> <sanjoy at playingwithpointers.com> wrote:
>>>>
>>>> Hi,
>>>>
>>>> My bias is to use both (b) and (d), since they have
complementary
>>>> strengths.  We should use (b) for expressing concepts that
can't be
>>>> semantically modeled as a call or invoke (this branch takes
both its
>>>> successors), and (d) for expressing things that can be (this
call may
>>>> never return), and annotation like things (this region (denoted
by
>>>> def-use of a token) is a reduction).
>>>>
>>>> I don't grok OpenMP, but perhaps we can come with one or
two
>>>> "generalized control flow"-type instructions that can
be used to
>>>> model the non-call/invoke like semantics we'd like LLVM to
know
>>>> about, and model the rest with (d)?
>>>>
>>>> -- Sanjoy
>>>>
>>>> On Thu, Jan 19, 2017 at 8:28 PM, Hal Finkel via llvm-dev
>>>> <llvm-dev at lists.llvm.org> wrote:
>>>>>
>>>>> On 01/19/2017 03:36 PM, Mehdi Amini via llvm-dev wrote:
>>>>>
>>>>>
>>>>> On Jan 19, 2017, at 1:32 PM, Daniel Berlin <dberlin at
dberlin.org> wrote:
>>>>>
>>>>>
>>>>>
>>>>>> On Thu, Jan 19, 2017 at 1:12 PM, Mehdi Amini
<mehdi.amini at apple.com>
>>>>>> wrote:
>>>>>>
>>>>>>
>>>>>> On Jan 19, 2017, at 12:04 PM, Daniel Berlin <dberlin
at dberlin.org>
>>>>>> wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, Jan 19, 2017 at 11:46 AM, Mehdi Amini via
llvm-dev
>>>>>> <llvm-dev at lists.llvm.org> wrote:
>>>>>>>
>>>>>>>
>>>>>>>> On Jan 19, 2017, at 11:36 AM, Adve, Vikram
Sadanand via llvm-dev
>>>>>>>> <llvm-dev at lists.llvm.org> wrote:
>>>>>>>>
>>>>>>>> Hi Johannes,
>>>>>>>>
>>>>>>>>> I am especially curious where you get your
data from. Tapir [0]
>>>>>>>>> (and to some degree PIR [1]) have shown
that,
>>>>>>>>> counterintuitively, only a few changes to
LLVM passes are
>>>>>>>>> needed. Tapir was recently used in an MIT
class with a lot of
>>>>>>>>> students and it seemed to work well with
only minimal changes to
>>>>>>>>> analysis and especially transformation
passes.
>>>>>>>>
>>>>>>>> TAPIR is an elegant, small extension and, in
particular, I think
>>>>>>>> the idea of asymmetric parallel tasks and
control flow is a
>>>>>>>> clever way to express parallelism with serial
semantics, as in
>>>>>>>> Cilk.  Encoding the control flow extensions as
explicit
>>>>>>>> instructions is orthogonal to that, though
arguably more elegant
>>>>>>>> than using region tags + metadata.
>>>>>>>>
>>>>>>>> However, Cilk is a tiny language compared with
the full
>>>>>>>> complexity of other languages, like OpenMP.  To
take just one
>>>>>>>> example, TAPIR cannot express the ORDERED
construct of OpenMP.  A
>>>>>>>> more serious concern, IMO, is that TAPIR (like
Cilk) requires
>>>>>>>> serial semantics, whereas there are many
parallel languages, OpenMP
>>>>>>>> included, that do not obey that restriction.
>>>>>>>> Third, OpenMP has *numerous* clauses, e.g.,
REDUCTION or PRIVATE,
>>>>>>>> that are needed because without that, you’d be
dependent on
>>>>>>>> fundamentally hard compiler analyses to extract
the same
>>>>>>>> information for satisfactory parallel
performance; realistic
>>>>>>>> applications cannot depend on the success of
such analyses.
>>>>>>>
>>>>>>> I agree with this, but I’m also wondering if it
needs to be first
>>>>>>> class in the IR?
>>>>>>> For example we know our alias analysis is very
basic, and C/C++
>>>>>>> have a higher constraint thanks to their type
system, but we
>>>>>>> didn’t inject this higher level information that
helps the
>>>>>>> optimizer as first class IR constructs.
>>>>>>
>>>>>>
>>>>>> FWIW, while i agree with the general point, i
wouldn't use this
>>>>>> example.
>>>>>> Because we pretty much still suffer to this day because
of it (both
>>>>>> in AA, and devirt, and ...)  :) We can't always
even tell fields
>>>>>> apart
>>>>>>
>>>>>>
>>>>>> Is it inherent to the infrastructure, i.e. using
metadata instead
>>>>>> of first class IR construct or is it just a “quality of
>>>>>> implementation” issue?
>>>>>
>>>>> Not to derail this conversation:
>>>>>
>>>>> IMHO, At some point there is no real difference :)
>>>>>
>>>>> Because otherwise, everything is a QOI issue.
>>>>>
>>>>> IE if it's super tricky to get metadata that works well
and works
>>>>> right, doesn't get lost, etc, and that's inherent
to using metadata,
>>>>> that to me is not a QOI issue.
>>>>>
>>>>> So could it be done with metadata? Probably?
>>>>> But at the same time,  if it had been done with more first
class
>>>>> constructs, it would have happened years ago  and been much
lower cost.
>>>>>
>>>>>
>>>>> This is what I meant by “inherent to the infrastructure”,
thanks for
>>>>> clarifying.
>>>>>
>>>>>
>>>>> To clarify, we were proposing metadata that is used as
arguments to
>>>>> the region-annotation intrinsics. This metadata has the
nice
>>>>> property that it does not get dropped (so it is just being
used as a
>>>>> way of encoding whatever data structures are necessary
without
>>>>> predefining a syntactic schema).
>>>>>
>>>>> -Hal
>>>>>
>>>>>
>>>>> —
>>>>> Mehdi
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> LLVM Developers mailing list
>>>>> llvm-dev at lists.llvm.org
>>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>>
>>>>>
>>>>> --
>>>>> Hal Finkel
>>>>> Lead, Compiler Technology and Programming Languages
Leadership
>>>>> Computing Facility Argonne National Laboratory
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> LLVM Developers mailing list
>>>>> llvm-dev at lists.llvm.org
>>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>>
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>
>
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>

Johannes Doerfert via llvm-dev

2017-Feb-02 10:44 UTC

head link

[llvm-dev] [RFC] IR-level Region Annotations

On 01/19, Sanjoy Das via llvm-dev wrote:> Now you could say that the llvm.experimental.intrinsic_a and
> llvm.experimental.intrinsic_b intrinsics are magic, and even such
> "obvious" optimizations are not allowed to happen across them;
but then
> calls to these intrinsics is pretty fundamentally different from
> "normal" calls, and are probably best modeled as new
instructions.
> You're going to have to do the same kind of auditing of passes either
> way, and the only extra cost of a new instruction is the extra bitcode
> reading / writing code.It does and I think ppl should therefor take a look at the alternative
proposal send out last week [0]. The "extra bitcode reading/writing
code" [1] is mechanical and should not be an issue.

Cheers,
  Johannes

[0] http://lists.llvm.org/pipermail/llvm-dev/2017-January/109615.html
[1] https://reviews.llvm.org/D29250
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 228 bytes
Desc: Digital signature
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170202/8b2f62cb/attachment.sig>

llvm dev - Jan 2017 - [RFC] IR-level Region Annotations

[llvm-dev] [RFC] IR-level Region Annotations

[llvm-dev] [RFC] IR-level Region Annotations

[llvm-dev] [RFC] IR-level Region Annotations

[llvm-dev] [RFC] IR-level Region Annotations

[llvm-dev] [RFC] IR-level Region Annotations

[llvm-dev] [RFC] IR-level Region Annotations

[llvm-dev] [RFC] IR-level Region Annotations

[llvm-dev] [RFC] IR-level Region Annotations

[llvm-dev] [RFC] IR-level Region Annotations

[llvm-dev] [RFC] IR-level Region Annotations