Tian, Xinmin via llvm-dev
2017-Feb-01 06:59 UTC
[llvm-dev] [RFC] IR-level Region Annotations
From: mehdi.amini at apple.com [mailto:mehdi.amini at apple.com] Sent: Tuesday, January 31, 2017 9:03 PM To: Tian, Xinmin <xinmin.tian at intel.com> Cc: Sanjoy Das <sanjoy at playingwithpointers.com>; Adve, Vikram Sadanand <vadve at illinois.edu>; llvm-dev at lists.llvm.org; llvm-dev-request at lists.llvm.org Subject: Re: [llvm-dev] [RFC] IR-level Region Annotations On Jan 31, 2017, at 7:53 PM, Tian, Xinmin <xinmin.tian at intel.com<mailto:xinmin.tian at intel.com>> wrote: In this case, inliner is educated to add all local variables to the tag of enclosing parallel region, if there is enclosing parallel region. So isn’t it a good example that shows that your intrinsic *cannot* be opaque and that IR passes need to be modified to handle not only the IR-region intrinsic but also the specific semantic of the tag? [XT] I thought we said a number of times, there are small changes to be made. I quoted a ball park # 2000 LOC vs. 6000 LOC w.r.t changes, in one of early emails. It seems to me that this contradicts the claim that the “tag” specific semantic does not need to be handled by the optimizer and that the intrinsic can integrate seamlessly in LLVM, which invalidates the approach (of a generic intrinsic) entirely IMO. Maybe you never intended to claim this, but this is a hidden cost in the original RFC, and I suspect this cost has to be carefully evaluated. At this point I’m not sure it is worth discussing anything further without seeing a proper LangRef update. [XT] All we said is to minimize cost when it is possible. The intrinsic functions is a generic for representing a directive and region, such as prefecth, unroll, omp, …. Each instance of them will have their semantics which will be in following up RFCs In our icc implementation, it is even simple, as we have routine level symbol table, the inliner adds ”private” attribute to those local variables w/o checking enclosing scope, the parallelizer does check and use it. Again, you’re trying to address a specific case, while I’m just trying to identify a generic class of problem that your proposal fails to address explicitly. [XT] It looks there is a mis-understanding of proposal. The proposal is to build up experimental framework with proposed interface to evaluate set of changes (or cost) needed for usage cases we know of. Hal and I carefully positioned this RFC. Even all intrinsics are named as experimental. So far, cost for handling these cases identified is within a range of our expectation. Best, — Mehdi From: mehdi.amini at apple.com<mailto:mehdi.amini at apple.com> [mailto:mehdi.amini at apple.com] Sent: Tuesday, January 31, 2017 7:31 PM To: Tian, Xinmin <xinmin.tian at intel.com<mailto:xinmin.tian at intel.com>> Cc: Sanjoy Das <sanjoy at playingwithpointers.com<mailto:sanjoy at playingwithpointers.com>>; Adve, Vikram Sadanand <vadve at illinois.edu<mailto:vadve at illinois.edu>>; llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>; llvm-dev-request at lists.llvm.org<mailto:llvm-dev-request at lists.llvm.org> Subject: Re: [llvm-dev] [RFC] IR-level Region Annotations Sent from my iPhone On Jan 31, 2017, at 7:27 PM, Tian, Xinmin <xinmin.tian at intel.com<mailto:xinmin.tian at intel.com>> wrote: Remember that, the prepare-phase is invoked in the FE or right after FE, inlining is not happening, that is why we don't call it "pass". Chandler made a good point for this case a long time back. What I was describing is the inlining in the optimizer pipeline. Hoisting alloca is totally ok. A new alloca is generated during outlining later on for anything marked as "private" (so long the "private" information is saves in the tag). I thought we talked this in an early email. Can you describe how (and at which point) you get the private for "var" added to the tag? -- Mehdi By the way, all concerns you have are all valid, we had worked on resolving these issues 10+ years back when we did similar things in our compilers. I wouldn't claim we have perfect solutions, but we do reasonable good solutions for handling general directives and openmp directives. Xinmin -----Original Message----- From: mehdi.amini at apple.com<mailto:mehdi.amini at apple.com> [mailto:mehdi.amini at apple.com] Sent: Tuesday, January 31, 2017 7:08 PM To: Tian, Xinmin <xinmin.tian at intel.com<mailto:xinmin.tian at intel.com>> Cc: Sanjoy Das <sanjoy at playingwithpointers.com<mailto:sanjoy at playingwithpointers.com>>; Adve, Vikram Sadanand <vadve at illinois.edu<mailto:vadve at illinois.edu>>; llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>; llvm-dev-request at lists.llvm.org<mailto:llvm-dev-request at lists.llvm.org> Subject: Re: [llvm-dev] [RFC] IR-level Region Annotations On Jan 31, 2017, at 6:48 PM, Tian, Xinmin <xinmin.tian at intel.com<mailto:xinmin.tian at intel.com>> wrote: Let me try this. You can simply consider the prepare-phase (e.g. pre-privatization) were done in FE (actually a library can be used by multiple FEs at LLVM IR level), the region is run with 1 thread, region annotation (scope, single-entry-single-exit) as memory barrier conservatively for now (instead of checking individual memory dependency, aliasing via tags which is the actual implementation is done) marked with region intrinsic functions. What optimization will mess up with this region-annotation? The first thing that comes to my mind is inlining that can put the IR in a form that breaks the invariant you tried to enforce with your "prepare-phase” (for example by hoisting an allocas). — Mehdi -----Original Message----- From: mehdi.amini at apple.com<mailto:mehdi.amini at apple.com> [mailto:mehdi.amini at apple.com] Sent: Tuesday, January 31, 2017 5:47 PM To: Tian, Xinmin <xinmin.tian at intel.com<mailto:xinmin.tian at intel.com>> Cc: Sanjoy Das <sanjoy at playingwithpointers.com<mailto:sanjoy at playingwithpointers.com>>; Adve, Vikram Sadanand <vadve at illinois.edu<mailto:vadve at illinois.edu>>; llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>; llvm-dev-request at lists.llvm.org<mailto:llvm-dev-request at lists.llvm.org> Subject: Re: [llvm-dev] [RFC] IR-level Region Annotations On Jan 31, 2017, at 5:38 PM, Tian, Xinmin <xinmin.tian at intel.com<mailto:xinmin.tian at intel.com>> wrote: Ok, but this looks like a “workaround" for your specific use-case, I don’t see how it can scale as a model-agnostic and general-purpose region semantic. I would say it is a design trade-off. I’m not sure if we’re talking about the same thing here: my understanding at this point is that the design trading-off you’re making “simplicity” by scarifying “correctness”. Requiring the IR to stay in what you’re calling a “canonical” form in your answer to Sanjoy in order to not miscompile a program is not an approach that seems compatible with how we deal with the IR usually. Regardless it is a new instruction or an intrinsics with token/tag, it will consist of model-agnostic part and model-non-agnostic part. The package comes with a framework for parsing and using these intrinsics. See the reply I had for Sanjoy's email. The answer to Sanjoy is not really helpful to clarify anything to me. At this point I still don’t understand how this is supposed to be correct in general. It would be helpful to have a LangRef patch that describes the semantic associated to your region intrinsics. Then we may be able to process some examples through the formalized description. — Mehdi -----Original Message----- From: mehdi.amini at apple.com<mailto:mehdi.amini at apple.com> [mailto:mehdi.amini at apple.com] Sent: Saturday, January 21, 2017 1:57 PM To: Tian, Xinmin <xinmin.tian at intel.com<mailto:xinmin.tian at intel.com>> Cc: Sanjoy Das <sanjoy at playingwithpointers.com<mailto:sanjoy at playingwithpointers.com>>; Adve, Vikram Sadanand <vadve at illinois.edu<mailto:vadve at illinois.edu>>; llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>; llvm-dev-request at lists.llvm.org<mailto:llvm-dev-request at lists.llvm.org> Subject: Re: [llvm-dev] [RFC] IR-level Region Annotations On Jan 20, 2017, at 11:17 AM, Tian, Xinmin <xinmin.tian at intel.com<mailto:xinmin.tian at intel.com>> wrote: This means that the optimizer has to be aware of it, I’m missing the magic here? This is one option. The another option is that, as I mentioned in our LLVM-HPC paper in our implementation. We have a "prepare phase for pre-privatization" can be invoked by both Clang FE and Fortran FE right after LLVM IR is generated. So, in this way, we are able to minimize the optimizations impact for the original val and I Ok, but this looks like a “workaround" for your specific use-case, I don’t see how it can scale as a model-agnostic and general-purpose region semantic. The fact that you needed this pre-step in the first place seems to indicate to me that it confirms what multiple people expressed in this thread, for example what Daniel wrote here: http://lists.llvm.org/pipermail/llvm-dev/2017-January/108997.html — Mehdi { void main() { i32* val = alloca i32 i32* I = alloca 32 i32* priv_val = alloca i32 i32* priv_i alloca 32 tok = llvm.experimental.intrinsic_a()[ "DIR.PARALLEL"(),"QUAL.PRIVATE"(i32* %priv_val, i32 %priv_i), "QUAL.NUM_THREADS"(i32 4)] %priv_i = omp_get_thread_num(); compute_something_into_val(%priv_val, %priv_i); a[priv_i] = %priv_val; llvm.experimental.intrinsic_b(tok)["DIR.END.PARALLEL"()]; .... I Val Foo(val, i). } "Prepare phase" is our way of minimizing the impact to existing optimizations. Xinmin -----Original Message----- From: mehdi.amini at apple.com<mailto:mehdi.amini at apple.com> [mailto:mehdi.amini at apple.com] Sent: Friday, January 20, 2017 10:54 AM To: Tian, Xinmin Cc: Sanjoy Das; Adve, Vikram Sadanand; llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>; llvm-dev-request at lists.llvm.org<mailto:llvm-dev-request at lists.llvm.org> Subject: Re: [llvm-dev] [RFC] IR-level Region Annotations On Jan 20, 2017, at 10:44 AM, Tian, Xinmin via llvm-dev <llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>> wrote: Sanjoy, the IR would be like something below. It is ok to hoist alloca instruction outside the region. There are some small changes in optimizer to understand region-annotation intrinsic. { void main() { i32* val = alloca i32 tok = llvm.experimental.intrinsic_a()[ "DIR.PARALLEL"(), "QUAL.PRIVATE"(i32* val), "QUAL.NUM_THREADS"(i32 4)] int i = omp_get_thread_num(); compute_something_into_val(val, i); a[i] = val; llvm.experimental.intrinsic_b(tok)["DIR.END.PARALLEL"()]; } With above representation, we can do privatization and outlining as below { void main() { i32* val = alloca i32 i32* I = alloca 32 tok = llvm.experimental.intrinsic_a()[ "DIR.PARALLEL"(), "QUAL.PRIVATE"(i32* %val, i32 %i), "QUAL.NUM_THREADS"(i32 4)] %ii = omp_get_thread_num(); compute_something_into_val(%val, %i); a[i] = %val; llvm.experimental.intrinsic_b(tok)["DIR.END.PARALLEL"()]; } Here we come to the interesting part: the hoisting of "i32* I = alloca 32” above the intrinsics required to update the intrinsics information “QUAL.PRIVATE”. This means that the optimizer has to be aware of it, I’m missing the magic here? I understand that an openmp specific optimization can do it, the question is how it an openmp agnostic supposed to behave in face of llvm.experimental.intrinsic_a? — Mehdi 1. create i32* priv_val = alloca i32 %priv_i = ...in the region, and replace all %val with %prv_val in the region. 2. perform outlining. Caller code .... omp_push_num_threads(4) omp_fork_call( .... outline_par_region....) .... Callee code: Outlined_par_rgion { I32* priv_val = alloca 32 I32* priv_i = .... Ret } For OpenMP, we do support it at -O0, -O1, -O2 and -O3. We had to make sure it runs correctly w/ and w/o optimizations and advanced analysis. So we need to preserve all source information for BE. You can take a look our LLVM-HPC paper for a bit some details. There are still tons of work to be done. Thanks. Xinmin -----Original Message----- From: llvm-dev [mailto:llvm-dev-bounces at lists.llvm.org] On Behalf Of Sanjoy Das via llvm-dev Sent: Thursday, January 19, 2017 10:13 PM To: Adve, Vikram Sadanand <vadve at illinois.edu<mailto:vadve at illinois.edu>> Cc: llvm-dev <llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>>; llvm-dev-request at lists.llvm.org<mailto:llvm-dev-request at lists.llvm.org> Subject: Re: [llvm-dev] [RFC] IR-level Region Annotations Hi Vikram, On Thu, Jan 19, 2017 at 9:27 PM, Adve, Vikram Sadanand <vadve at illinois.edu<mailto:vadve at illinois.edu>> wrote: Hi Sanjoy, Yes, that's exactly what we have been looking at recently here, but the region tags seem to make it possible to express the control flow as well, so I think we could start with reg ions+metadata, as Hal and I'm not yet convinced that region tags are sufficient to model exotic control flow. (I don't know OpenMP so this is a copy-pasted-edited example) Say we have: void main() { #pragma omp parallel num_threads(4) { int i = omp_get_thread_num(); int val; compute_something_into_val(&val, i); a[i] = val; } } I presume the (eventual) intended lowering is something like this (if the intended lowering is different than this, and avoids the issue I'm trying to highlight then my point is moot): void main() { tok = llvm.experimental.intrinsic_a(); int i = omp_get_thread_num(); i32* val = alloca i32 compute_something_into_val(val, i); a[i] = val; llvm.experimental.intrinsic_b(tok); } However, LLVM is free to hoist the alloca to the entry block: void main() { i32* val = alloca i32 tok = llvm.experimental.intrinsic_a(); int i = omp_get_thread_num(); compute_something_into_val(val, i); a[i] = val; llvm.experimental.intrinsic_b(tok); } and now you have a race between the four parallel forks. The problem here is that nothing in the IR expresses that we have four copies of the region running "at the same time". In fact, such a control flow is alien to LLVM today. For instance, another evil optimization may turn: void main() { int a[4]; #pragma omp parallel num_threads(4) { int i = omp_get_thread_num(); int val = compute_something_into_val(i); a[i] = val; } return a[0] + a[1]; } to void main() { int a[4]; #pragma omp parallel num_threads(4) { int i = omp_get_thread_num(); int val = compute_something_into_val(i); a[i] = val; } return undef; } since a[i] = val could have initialized at most one element in a. Now you could say that the llvm.experimental.intrinsic_a and llvm.experimental.intrinsic_b intrinsics are magic, and even such "obvious" optimizations are not allowed to happen across them; but then calls to these intrinsics is pretty fundamentally different from "normal" calls, and are probably best modeled as new instructions. You're going to have to do the same kind of auditing of passes either way, and the only extra cost of a new instruction is the extra bitcode reading / writing code. I hope I made sense. -- Sanjoy Xinmin proposed, and then figure out what needs to be first class instructions. --Vikram Adve On Jan 19, 2017, at 11:03 PM, Sanjoy Das <sanjoy at playingwithpointers.com<mailto:sanjoy at playingwithpointers.com>> wrote: Hi, My bias is to use both (b) and (d), since they have complementary strengths. We should use (b) for expressing concepts that can't be semantically modeled as a call or invoke (this branch takes both its successors), and (d) for expressing things that can be (this call may never return), and annotation like things (this region (denoted by def-use of a token) is a reduction). I don't grok OpenMP, but perhaps we can come with one or two "generalized control flow"-type instructions that can be used to model the non-call/invoke like semantics we'd like LLVM to know about, and model the rest with (d)? -- Sanjoy On Thu, Jan 19, 2017 at 8:28 PM, Hal Finkel via llvm-dev <llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>> wrote: On 01/19/2017 03:36 PM, Mehdi Amini via llvm-dev wrote: On Jan 19, 2017, at 1:32 PM, Daniel Berlin <dberlin at dberlin.org<mailto:dberlin at dberlin.org>> wrote: On Thu, Jan 19, 2017 at 1:12 PM, Mehdi Amini <mehdi.amini at apple.com<mailto:mehdi.amini at apple.com>> wrote: On Jan 19, 2017, at 12:04 PM, Daniel Berlin <dberlin at dberlin.org<mailto:dberlin at dberlin.org>> wrote: On Thu, Jan 19, 2017 at 11:46 AM, Mehdi Amini via llvm-dev <llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>> wrote: On Jan 19, 2017, at 11:36 AM, Adve, Vikram Sadanand via llvm-dev <llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>> wrote: Hi Johannes, I am especially curious where you get your data from. Tapir [0] (and to some degree PIR [1]) have shown that, counterintuitively, only a few changes to LLVM passes are needed. Tapir was recently used in an MIT class with a lot of students and it seemed to work well with only minimal changes to analysis and especially transformation passes. TAPIR is an elegant, small extension and, in particular, I think the idea of asymmetric parallel tasks and control flow is a clever way to express parallelism with serial semantics, as in Cilk. Encoding the control flow extensions as explicit instructions is orthogonal to that, though arguably more elegant than using region tags + metadata. However, Cilk is a tiny language compared with the full complexity of other languages, like OpenMP. To take just one example, TAPIR cannot express the ORDERED construct of OpenMP. A more serious concern, IMO, is that TAPIR (like Cilk) requires serial semantics, whereas there are many parallel languages, OpenMP included, that do not obey that restriction. Third, OpenMP has *numerous* clauses, e.g., REDUCTION or PRIVATE, that are needed because without that, you’d be dependent on fundamentally hard compiler analyses to extract the same information for satisfactory parallel performance; realistic applications cannot depend on the success of such analyses. I agree with this, but I’m also wondering if it needs to be first class in the IR? For example we know our alias analysis is very basic, and C/C++ have a higher constraint thanks to their type system, but we didn’t inject this higher level information that helps the optimizer as first class IR constructs. FWIW, while i agree with the general point, i wouldn't use this example. Because we pretty much still suffer to this day because of it (both in AA, and devirt, and ...) :) We can't always even tell fields apart Is it inherent to the infrastructure, i.e. using metadata instead of first class IR construct or is it just a “quality of implementation” issue? Not to derail this conversation: IMHO, At some point there is no real difference :) Because otherwise, everything is a QOI issue. IE if it's super tricky to get metadata that works well and works right, doesn't get lost, etc, and that's inherent to using metadata, that to me is not a QOI issue. So could it be done with metadata? Probably? But at the same time, if it had been done with more first class constructs, it would have happened years ago and been much lower cost. This is what I meant by “inherent to the infrastructure”, thanks for clarifying. To clarify, we were proposing metadata that is used as arguments to the region-annotation intrinsics. This metadata has the nice property that it does not get dropped (so it is just being used as a way of encoding whatever data structures are necessary without predefining a syntactic schema). -Hal — Mehdi _______________________________________________ LLVM Developers mailing list llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev -- Hal Finkel Lead, Compiler Technology and Programming Languages Leadership Computing Facility Argonne National Laboratory _______________________________________________ LLVM Developers mailing list llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev _______________________________________________ LLVM Developers mailing list llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev _______________________________________________ LLVM Developers mailing list llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170201/b129ca58/attachment-0001.html>
Mehdi Amini via llvm-dev
2017-Feb-01 07:29 UTC
[llvm-dev] [RFC] IR-level Region Annotations
> On Jan 31, 2017, at 10:59 PM, Tian, Xinmin <xinmin.tian at intel.com> wrote: > > > <> > From: mehdi.amini at apple.com <mailto:mehdi.amini at apple.com> [mailto:mehdi.amini at apple.com <mailto:mehdi.amini at apple.com>] > Sent: Tuesday, January 31, 2017 9:03 PM > To: Tian, Xinmin <xinmin.tian at intel.com <mailto:xinmin.tian at intel.com>> > Cc: Sanjoy Das <sanjoy at playingwithpointers.com <mailto:sanjoy at playingwithpointers.com>>; Adve, Vikram Sadanand <vadve at illinois.edu <mailto:vadve at illinois.edu>>; llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>; llvm-dev-request at lists.llvm.org <mailto:llvm-dev-request at lists.llvm.org> > Subject: Re: [llvm-dev] [RFC] IR-level Region Annotations > > > On Jan 31, 2017, at 7:53 PM, Tian, Xinmin <xinmin.tian at intel.com <mailto:xinmin.tian at intel.com>> wrote: > > In this case, inliner is educated to add all local variables to the tag of enclosing parallel region, if there is enclosing parallel region. > > > So isn’t it a good example that shows that your intrinsic *cannot* be opaque and that IR passes need to be modified to handle not only the IR-region intrinsic but also the specific semantic of the tag? > > [XT] I thought we said a number of times, there are small changes to be made. I quoted a ball park # 2000 LOC vs. 6000 LOC w.r.t changes, in one of early emails.This didn’t mean that the changes were meant specifically for OpenMP. My understanding was that this proposal is for a generic "IR-level Region Annotations” mechanism, and that’s what the changes were for. Now it ends up being “let’s support OpenMP semantic without adding openmp in the intrinsic names”.> > It seems to me that this contradicts the claim that the “tag” specific semantic does not need to be handled by the optimizer and that the intrinsic can integrate seamlessly in LLVM, which invalidates the approach (of a generic intrinsic) entirely IMO. > > Maybe you never intended to claim this, but this is a hidden cost in the original RFC, and I suspect this cost has to be carefully evaluated. At this point I’m not sure it is worth discussing anything further without seeing a proper LangRef update. > > [XT] All we said is to minimize cost when it is possible. The intrinsic functions is a generic for representing a directive and region, such as prefecth, unroll, omp, …. Each instance of them will have their semantics which will be in following up RFCsAt this point I don’t see any advantage in having a “generic intrinsic" that has an opaque tag since all the semantic is in the tag anyway. I’d have to see what is really “generic” in the handling of it... Reid identified this very early in the thread (he is a lot more perspicacious than I am) here: http://lists.llvm.org/pipermail/llvm-dev/2017-January/108914.html — Mehdi> > In our icc implementation, it is even simple, as we have routine level symbol table, the inliner adds ”private” attribute to those local variables w/o checking enclosing scope, the parallelizer does check and use it. > > Again, you’re trying to address a specific case, while I’m just trying to identify a generic class of problem that your proposal fails to address explicitly. > > [XT] It looks there is a mis-understanding of proposal. The proposal is to build up experimental framework with proposed interface to evaluate set of changes (or cost) needed for usage cases we know of. Hal and I carefully positioned this RFC. Even all intrinsics are named as experimental. So far, cost for handling these cases identified is within a range of our expectation. > > Best, > > — > Mehdi > > > > > From: mehdi.amini at apple.com <mailto:mehdi.amini at apple.com> [mailto:mehdi.amini at apple.com <mailto:mehdi.amini at apple.com>] > Sent: Tuesday, January 31, 2017 7:31 PM > To: Tian, Xinmin <xinmin.tian at intel.com <mailto:xinmin.tian at intel.com>> > Cc: Sanjoy Das <sanjoy at playingwithpointers.com <mailto:sanjoy at playingwithpointers.com>>; Adve, Vikram Sadanand <vadve at illinois.edu <mailto:vadve at illinois.edu>>; llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>; llvm-dev-request at lists.llvm.org <mailto:llvm-dev-request at lists.llvm.org> > Subject: Re: [llvm-dev] [RFC] IR-level Region Annotations > > > > Sent from my iPhone > > On Jan 31, 2017, at 7:27 PM, Tian, Xinmin <xinmin.tian at intel.com <mailto:xinmin.tian at intel.com>> wrote: > > Remember that, the prepare-phase is invoked in the FE or right after FE, inlining is not happening, that is why we don't call it "pass". Chandler made a good point for this case a long time back. > > > What I was describing is the inlining in the optimizer pipeline. > > > > Hoisting alloca is totally ok. A new alloca is generated during outlining later on for anything marked as "private" (so long the "private" information is saves in the tag). I thought we talked this in an early email. > > > Can you describe how (and at which point) you get the private for "var" added to the tag? > > > -- > Mehdi > > > By the way, all concerns you have are all valid, we had worked on resolving these issues 10+ years back when we did similar things in our compilers. I wouldn't claim we have perfect solutions, but we do reasonable good solutions for handling general directives and openmp directives. > > Xinmin > > -----Original Message----- > From: mehdi.amini at apple.com <mailto:mehdi.amini at apple.com> [mailto:mehdi.amini at apple.com <mailto:mehdi.amini at apple.com>] > Sent: Tuesday, January 31, 2017 7:08 PM > To: Tian, Xinmin <xinmin.tian at intel.com <mailto:xinmin.tian at intel.com>> > Cc: Sanjoy Das <sanjoy at playingwithpointers.com <mailto:sanjoy at playingwithpointers.com>>; Adve, Vikram Sadanand <vadve at illinois.edu <mailto:vadve at illinois.edu>>; llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>; llvm-dev-request at lists.llvm.org <mailto:llvm-dev-request at lists.llvm.org> > Subject: Re: [llvm-dev] [RFC] IR-level Region Annotations > > > > > On Jan 31, 2017, at 6:48 PM, Tian, Xinmin <xinmin.tian at intel.com <mailto:xinmin.tian at intel.com>> wrote: > > Let me try this. > > You can simply consider the prepare-phase (e.g. pre-privatization) were done in FE (actually a library can be used by multiple FEs at LLVM IR level), the region is run with 1 thread, region annotation (scope, single-entry-single-exit) as memory barrier conservatively for now (instead of checking individual memory dependency, aliasing via tags which is the actual implementation is done) marked with region intrinsic functions. What optimization will mess up with this region-annotation? > > The first thing that comes to my mind is inlining that can put the IR in a form that breaks the invariant you tried to enforce with your "prepare-phase” (for example by hoisting an allocas). > > — > Mehdi > > > > > > > -----Original Message----- > From: mehdi.amini at apple.com <mailto:mehdi.amini at apple.com> [mailto:mehdi.amini at apple.com <mailto:mehdi.amini at apple.com>] > Sent: Tuesday, January 31, 2017 5:47 PM > To: Tian, Xinmin <xinmin.tian at intel.com <mailto:xinmin.tian at intel.com>> > Cc: Sanjoy Das <sanjoy at playingwithpointers.com <mailto:sanjoy at playingwithpointers.com>>; Adve, Vikram Sadanand <vadve at illinois.edu <mailto:vadve at illinois.edu>>; llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>; llvm-dev-request at lists.llvm.org <mailto:llvm-dev-request at lists.llvm.org> > Subject: Re: [llvm-dev] [RFC] IR-level Region Annotations > > > On Jan 31, 2017, at 5:38 PM, Tian, Xinmin <xinmin.tian at intel.com <mailto:xinmin.tian at intel.com>> wrote: > > Ok, but this looks like a “workaround" for your specific use-case, I don’t see how it can scale as a model-agnostic and general-purpose region semantic. > > I would say it is a design trade-off. > > I’m not sure if we’re talking about the same thing here: my understanding at this point is that the design trading-off you’re making “simplicity” by scarifying “correctness”. > > Requiring the IR to stay in what you’re calling a “canonical” form in your answer to Sanjoy in order to not miscompile a program is not an approach that seems compatible with how we deal with the IR usually. > > Regardless it is a new instruction or an intrinsics with token/tag, it will consist of model-agnostic part and model-non-agnostic part. The package comes with a framework for parsing and using these intrinsics. See the reply I had for Sanjoy's email. > > The answer to Sanjoy is not really helpful to clarify anything to me. At this point I still don’t understand how this is supposed to be correct in general. > > It would be helpful to have a LangRef patch that describes the semantic associated to your region intrinsics. Then we may be able to process some examples through the formalized description. > > — > Mehdi > > > > -----Original Message----- > From: mehdi.amini at apple.com <mailto:mehdi.amini at apple.com> [mailto:mehdi.amini at apple.com <mailto:mehdi.amini at apple.com>] > Sent: Saturday, January 21, 2017 1:57 PM > To: Tian, Xinmin <xinmin.tian at intel.com <mailto:xinmin.tian at intel.com>> > Cc: Sanjoy Das <sanjoy at playingwithpointers.com <mailto:sanjoy at playingwithpointers.com>>; Adve, Vikram Sadanand <vadve at illinois.edu <mailto:vadve at illinois.edu>>; llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>; llvm-dev-request at lists.llvm.org <mailto:llvm-dev-request at lists.llvm.org> > Subject: Re: [llvm-dev] [RFC] IR-level Region Annotations > > > On Jan 20, 2017, at 11:17 AM, Tian, Xinmin <xinmin.tian at intel.com <mailto:xinmin.tian at intel.com>> wrote: > > This means that the optimizer has to be aware of it, I’m missing the magic here? > > This is one option. > > The another option is that, as I mentioned in our LLVM-HPC paper in our implementation. We have a "prepare phase for pre-privatization" can be invoked by both Clang FE and Fortran FE right after LLVM IR is generated. So, in this way, we are able to minimize the optimizations impact for the original val and I > > Ok, but this looks like a “workaround" for your specific use-case, I don’t see how it can scale as a model-agnostic and general-purpose region semantic. > > The fact that you needed this pre-step in the first place seems to indicate to me that it confirms what multiple people expressed in this thread, for example what Daniel wrote here: http://lists.llvm.org/pipermail/llvm-dev/2017-January/108997.html <http://lists.llvm.org/pipermail/llvm-dev/2017-January/108997.html> > > — > Mehdi > > > > > { void main() { > i32* val = alloca i32 > i32* I = alloca 32 > i32* priv_val = alloca i32 > i32* priv_i alloca 32 > tok = llvm.experimental.intrinsic_a()[ "DIR.PARALLEL"(),"QUAL.PRIVATE"(i32* %priv_val, i32 %priv_i), "QUAL.NUM_THREADS"(i32 4)] > > %priv_i = omp_get_thread_num(); > compute_something_into_val(%priv_val, %priv_i); > a[priv_i] = %priv_val; > > llvm.experimental.intrinsic_b(tok)["DIR.END.PARALLEL"()]; > .... > > I = > Val > Foo(val, i). > } > > "Prepare phase" is our way of minimizing the impact to existing optimizations. > > Xinmin > > -----Original Message----- > From: mehdi.amini at apple.com <mailto:mehdi.amini at apple.com> [mailto:mehdi.amini at apple.com <mailto:mehdi.amini at apple.com>] > Sent: Friday, January 20, 2017 10:54 AM > To: Tian, Xinmin > Cc: Sanjoy Das; Adve, Vikram Sadanand; llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>; llvm-dev-request at lists.llvm.org <mailto:llvm-dev-request at lists.llvm.org> > Subject: Re: [llvm-dev] [RFC] IR-level Region Annotations > > > On Jan 20, 2017, at 10:44 AM, Tian, Xinmin via llvm-dev <llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>> wrote: > > Sanjoy, the IR would be like something below. It is ok to hoist alloca instruction outside the region. There are some small changes in optimizer to understand region-annotation intrinsic. > > { void main() { > i32* val = alloca i32 > tok = llvm.experimental.intrinsic_a()[ "DIR.PARALLEL"(), > "QUAL.PRIVATE"(i32* val), "QUAL.NUM_THREADS"(i32 4)] > > int i = omp_get_thread_num(); > compute_something_into_val(val, i); > a[i] = val; > > llvm.experimental.intrinsic_b(tok)["DIR.END.PARALLEL"()]; > } > > With above representation, we can do privatization and outlining as > below > > { void main() { > i32* val = alloca i32 > i32* I = alloca 32 > tok = llvm.experimental.intrinsic_a()[ "DIR.PARALLEL"(), > "QUAL.PRIVATE"(i32* %val, i32 %i), "QUAL.NUM_THREADS"(i32 4)] > > %ii = omp_get_thread_num(); > compute_something_into_val(%val, %i); a[i] = %val; > > llvm.experimental.intrinsic_b(tok)["DIR.END.PARALLEL"()]; > } > > > Here we come to the interesting part: the hoisting of "i32* I = alloca 32” above the intrinsics required to update the intrinsics information “QUAL.PRIVATE”. > This means that the optimizer has to be aware of it, I’m missing the magic here? > I understand that an openmp specific optimization can do it, the question is how it an openmp agnostic supposed to behave in face of llvm.experimental.intrinsic_a? > > — > Mehdi > > > > 1. create i32* priv_val = alloca i32 %priv_i = ...in the region, and replace all %val with %prv_val in the region. > 2. perform outlining. > > Caller code > .... > omp_push_num_threads(4) > omp_fork_call( .... outline_par_region....) .... > > Callee code: > Outlined_par_rgion { > I32* priv_val = alloca 32 > I32* priv_i = .... > > Ret > } > > For OpenMP, we do support it at -O0, -O1, -O2 and -O3. We had to make sure it runs correctly w/ and w/o optimizations and advanced analysis. So we need to preserve all source information for BE. > You can take a look our LLVM-HPC paper for a bit some details. There are still tons of work to be done. Thanks. > > Xinmin > > -----Original Message----- > From: llvm-dev [mailto:llvm-dev-bounces at lists.llvm.org <mailto:llvm-dev-bounces at lists.llvm.org>] On Behalf Of Sanjoy Das via llvm-dev > Sent: Thursday, January 19, 2017 10:13 PM > To: Adve, Vikram Sadanand <vadve at illinois.edu <mailto:vadve at illinois.edu>> > Cc: llvm-dev <llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>>; llvm-dev-request at lists.llvm.org <mailto:llvm-dev-request at lists.llvm.org> > Subject: Re: [llvm-dev] [RFC] IR-level Region Annotations > > Hi Vikram, > > On Thu, Jan 19, 2017 at 9:27 PM, Adve, Vikram Sadanand <vadve at illinois.edu <mailto:vadve at illinois.edu>> wrote: > Hi Sanjoy, > > Yes, that's exactly what we have been looking at recently here, but > the region tags seem to make it possible to express the control flow > as well, so I think we could start with reg ions+metadata, as Hal and > > I'm not yet convinced that region tags are sufficient to model exotic control flow. > > (I don't know OpenMP so this is a copy-pasted-edited example) > > Say we have: > > void main() { > #pragma omp parallel num_threads(4) > { > int i = omp_get_thread_num(); > int val; > compute_something_into_val(&val, i); > a[i] = val; > } > } > > I presume the (eventual) intended lowering is something like this (if the intended lowering is different than this, and avoids the issue I'm trying to highlight then my point is moot): > > void main() { > tok = llvm.experimental.intrinsic_a(); > > int i = omp_get_thread_num(); > i32* val = alloca i32 > compute_something_into_val(val, i); > a[i] = val; > > llvm.experimental.intrinsic_b(tok); > } > > However, LLVM is free to hoist the alloca to the entry block: > > void main() { > i32* val = alloca i32 > tok = llvm.experimental.intrinsic_a(); > > int i = omp_get_thread_num(); > compute_something_into_val(val, i); > a[i] = val; > > llvm.experimental.intrinsic_b(tok); > } > > and now you have a race between the four parallel forks. > > The problem here is that nothing in the IR expresses that we have four copies of the region running "at the same time". In fact, such a control flow is alien to LLVM today. > > For instance, another evil optimization may turn: > > void main() { > int a[4]; > #pragma omp parallel num_threads(4) > { > int i = omp_get_thread_num(); > int val = compute_something_into_val(i); > a[i] = val; > } > > return a[0] + a[1]; > } > > to > > void main() { > int a[4]; > #pragma omp parallel num_threads(4) > { > int i = omp_get_thread_num(); > int val = compute_something_into_val(i); > a[i] = val; > } > > return undef; > } > > since a[i] = val could have initialized at most one element in a. > > Now you could say that the llvm.experimental.intrinsic_a and llvm.experimental.intrinsic_b intrinsics are magic, and even such "obvious" optimizations are not allowed to happen across them; but then calls to these intrinsics is pretty fundamentally different from "normal" calls, and are probably best modeled as new instructions. > You're going to have to do the same kind of auditing of passes either way, and the only extra cost of a new instruction is the extra bitcode reading / writing code. > > I hope I made sense. > > -- Sanjoy > > Xinmin proposed, and then figure out what needs to be first class > instructions. > > > --Vikram Adve > > > > On Jan 19, 2017, at 11:03 PM, Sanjoy Das <sanjoy at playingwithpointers.com <mailto:sanjoy at playingwithpointers.com>> wrote: > > Hi, > > My bias is to use both (b) and (d), since they have complementary > strengths. We should use (b) for expressing concepts that can't be > semantically modeled as a call or invoke (this branch takes both its > successors), and (d) for expressing things that can be (this call may > never return), and annotation like things (this region (denoted by > def-use of a token) is a reduction). > > I don't grok OpenMP, but perhaps we can come with one or two > "generalized control flow"-type instructions that can be used to > model the non-call/invoke like semantics we'd like LLVM to know > about, and model the rest with (d)? > > -- Sanjoy > > On Thu, Jan 19, 2017 at 8:28 PM, Hal Finkel via llvm-dev > <llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>> wrote: > > On 01/19/2017 03:36 PM, Mehdi Amini via llvm-dev wrote: > > > On Jan 19, 2017, at 1:32 PM, Daniel Berlin <dberlin at dberlin.org <mailto:dberlin at dberlin.org>> wrote: > > > > On Thu, Jan 19, 2017 at 1:12 PM, Mehdi Amini <mehdi.amini at apple.com <mailto:mehdi.amini at apple.com>> wrote: > > > On Jan 19, 2017, at 12:04 PM, Daniel Berlin <dberlin at dberlin.org <mailto:dberlin at dberlin.org>> wrote: > > > > On Thu, Jan 19, 2017 at 11:46 AM, Mehdi Amini via llvm-dev > <llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>> wrote: > > > On Jan 19, 2017, at 11:36 AM, Adve, Vikram Sadanand via llvm-dev > <llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>> wrote: > > Hi Johannes, > > I am especially curious where you get your data from. Tapir [0] > (and to some degree PIR [1]) have shown that, > counterintuitively, only a few changes to LLVM passes are > needed. Tapir was recently used in an MIT class with a lot of > students and it seemed to work well with only minimal changes to > analysis and especially transformation passes. > > TAPIR is an elegant, small extension and, in particular, I think > the idea of asymmetric parallel tasks and control flow is a > clever way to express parallelism with serial semantics, as in > Cilk. Encoding the control flow extensions as explicit > instructions is orthogonal to that, though arguably more elegant than using region tags + metadata. > > However, Cilk is a tiny language compared with the full > complexity of other languages, like OpenMP. To take just one > example, TAPIR cannot express the ORDERED construct of OpenMP. A > more serious concern, IMO, is that TAPIR (like Cilk) requires > serial semantics, whereas there are many parallel languages, OpenMP included, that do not obey that restriction. > Third, OpenMP has *numerous* clauses, e.g., REDUCTION or PRIVATE, > that are needed because without that, you’d be dependent on > fundamentally hard compiler analyses to extract the same > information for satisfactory parallel performance; realistic > applications cannot depend on the success of such analyses. > > I agree with this, but I’m also wondering if it needs to be first > class in the IR? > For example we know our alias analysis is very basic, and C/C++ > have a higher constraint thanks to their type system, but we > didn’t inject this higher level information that helps the > optimizer as first class IR constructs. > > > FWIW, while i agree with the general point, i wouldn't use this example. > Because we pretty much still suffer to this day because of it (both > in AA, and devirt, and ...) :) We can't always even tell fields > apart > > > Is it inherent to the infrastructure, i.e. using metadata instead > of first class IR construct or is it just a “quality of implementation” issue? > > Not to derail this conversation: > > IMHO, At some point there is no real difference :) > > Because otherwise, everything is a QOI issue. > > IE if it's super tricky to get metadata that works well and works > right, doesn't get lost, etc, and that's inherent to using metadata, > that to me is not a QOI issue. > > So could it be done with metadata? Probably? > But at the same time, if it had been done with more first class > constructs, it would have happened years ago and been much lower cost. > > > This is what I meant by “inherent to the infrastructure”, thanks for > clarifying. > > > To clarify, we were proposing metadata that is used as arguments to > the region-annotation intrinsics. This metadata has the nice > property that it does not get dropped (so it is just being used as a > way of encoding whatever data structures are necessary without predefining a syntactic schema). > > -Hal > > > — > Mehdi > > > > > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org> > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev <http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev> > > > -- > Hal Finkel > Lead, Compiler Technology and Programming Languages Leadership > Computing Facility Argonne National Laboratory > > > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org> > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev <http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev> > > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org> > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev <http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev> > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org> > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev <http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev> >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170131/68f930c1/attachment-0001.html>
Hal Finkel via llvm-dev
2017-Feb-01 17:34 UTC
[llvm-dev] [RFC] IR-level Region Annotations
On 02/01/2017 01:29 AM, Mehdi Amini via llvm-dev wrote:> >> On Jan 31, 2017, at 10:59 PM, Tian, Xinmin <xinmin.tian at intel.com >> <mailto:xinmin.tian at intel.com>> wrote: >> >> *From:*mehdi.amini at apple.com >> <mailto:mehdi.amini at apple.com>[mailto:mehdi.amini at apple.com] >> *Sent:*Tuesday, January 31, 2017 9:03 PM >> *To:*Tian, Xinmin <xinmin.tian at intel.com <mailto:xinmin.tian at intel.com>> >> *Cc:*Sanjoy Das <sanjoy at playingwithpointers.com >> <mailto:sanjoy at playingwithpointers.com>>; Adve, Vikram Sadanand >> <vadve at illinois.edu >> <mailto:vadve at illinois.edu>>;llvm-dev at lists.llvm.org >> <mailto:llvm-dev at lists.llvm.org>;llvm-dev-request at lists.llvm.org >> <mailto:llvm-dev-request at lists.llvm.org> >> *Subject:*Re: [llvm-dev] [RFC] IR-level Region Annotations >> >> On Jan 31, 2017, at 7:53 PM, Tian, Xinmin <xinmin.tian at intel.com >> <mailto:xinmin.tian at intel.com>> wrote: >> In this case, inliner is educated to add all local variables to >> the tag of enclosing parallel region, if there is enclosing >> parallel region. >> >> So isn’t it a good example that shows that your intrinsic *cannot* be >> opaque and that IR passes need to be modified to handle not only the >> IR-region intrinsic but also the specific semantic of the tag? >> [XT] I thought we said a number of times, there are small changes to >> be made. I quoted a ball park # 2000 LOC vs. 6000 LOC w.r.t changes, >> in one of early emails. > > > This didn’t mean that the changes were meant specifically for OpenMP. > My understanding was that this proposal is for a generic "IR-level > Region Annotations” mechanism, and that’s what the changes were for. > Now it ends up being “let’s support OpenMP semantic without adding > openmp in the intrinsic names”.The point here is to abstract the properties about which other passes might need to know by using a set of generic intrinsics. The fact that you can't hoist allocas past one of these intrinsics, is nowhere close to saying that the individual optimization passes need to know anything about OpenMP, parallelism, etc. Regardless of how many LOC are in Intel's prototype, we're obviously aiming for minimal impact on the current upstream infrastructure.> > >> It seems to me that this contradicts the claim that the “tag” >> specific semantic does not need to be handled by the optimizer and >> that the intrinsic can integrate seamlessly in LLVM, which >> invalidates the approach (of a generic intrinsic) entirely IMO. >> Maybe you never intended to claim this, but this is a hidden cost in >> the original RFC, and I suspect this cost has to be carefully >> evaluated. At this point I’m not sure it is worth discussing anything >> further without seeing a proper LangRef update. >> [XT] All we said is to minimize cost when it is possible. The >> intrinsic functions is a generic for representing a directive and >> region, such as prefecth, unroll, omp, …. Each instance of them will >> have their semantics which will be in following up RFCs > > > At this point I don’t see any advantage in having a “generic > intrinsic" that has an opaque tag since all the semantic is in the tag > anyway. I’d have to see what is really “generic” in the handling of it...This is completely opposite to the point. The semantics relevant to the rest of the optimization pipeline should be in the intrinsics themselves. I've yet to see anything to suggest that we can't do that.> > Reid identified this very early in the thread (he is a lot more > perspicacious than I am) here: > http://lists.llvm.org/pipermail/llvm-dev/2017-January/108914.htmlThere are multiple levels here: a) Semantics relevant to the rest of the pipeline b) Semantics relevant to parallelism-specific optimizations (e.g. redundant barrier removal) c) Semantics relevant to specific programming model / extension (OpenMP, OpenACC, C++ parallel algorithms, whatever) We'd like to separate these three levels, and I believe the proposed scheme allows us to do that. Obviously, this assumes that we can indeed have a small set of intrinsics that satisfy the needs of (a). Furthermore, if we're going to use intrinsics, we need to decide whether all of the relevant semantics are reasonable to encode in intrinsics (e.g. it is reasonable to have an intrinsic past which you can't hoist an alloca, or would that need to be an instruction, etc.) -Hal> > — > Mehdi > > > >> In our icc implementation, it is even simple, as we have routine >> level symbol table, the inliner adds ”private” attribute to >> those local variables w/o checking enclosing scope, the >> parallelizer does check and use it. >> >> Again, you’re trying to address a specific case, while I’m just >> trying to identify a generic class of problem that your proposal >> fails to address explicitly. >> [XT] It looks there is a mis-understanding of proposal. The proposal >> is to build up experimental framework with proposed interface to >> evaluate set of changes (or cost) needed for usage cases we know of. >> Hal and I carefully positioned this RFC. Even all intrinsics are >> named as experimental. So far, cost for handling these cases >> identified is within a range of our expectation. >> Best, >> — >> Mehdi >> >> *From:*mehdi.amini at apple.com >> <mailto:mehdi.amini at apple.com>[mailto:mehdi.amini at apple.com] >> *Sent:*Tuesday, January 31, 2017 7:31 PM >> *To:*Tian, Xinmin <xinmin.tian at intel.com >> <mailto:xinmin.tian at intel.com>> >> *Cc:*Sanjoy Das <sanjoy at playingwithpointers.com >> <mailto:sanjoy at playingwithpointers.com>>; Adve, Vikram Sadanand >> <vadve at illinois.edu >> <mailto:vadve at illinois.edu>>;llvm-dev at lists.llvm.org >> <mailto:llvm-dev at lists.llvm.org>;llvm-dev-request at lists.llvm.org >> <mailto:llvm-dev-request at lists.llvm.org> >> *Subject:*Re: [llvm-dev] [RFC] IR-level Region Annotations >> >> Sent from my iPhone >> >> >> On Jan 31, 2017, at 7:27 PM, Tian, Xinmin <xinmin.tian at intel.com >> <mailto:xinmin.tian at intel.com>> wrote: >> >> Remember that, the prepare-phase is invoked in the FE or >> right after FE, inlining is not happening, that is why we >> don't call it "pass". Chandler made a good point for this >> case a long time back. >> >> What I was describing is the inlining in the optimizer pipeline. >> >> Hoisting alloca is totally ok. A new alloca is generated >> during outlining later on for anything marked as "private" >> (so long the "private" information is saves in the tag). I >> thought we talked this in an early email. >> >> Can you describe how (and at which point) you get the private for >> "var" added to the tag? >> -- >> Mehdi >> >> >> By the way, all concerns you have are all valid, we had >> worked on resolving these issues 10+ years back when we did >> similar things in our compilers. I wouldn't claim we have >> perfect solutions, but we do reasonable good solutions for >> handling general directives and openmp directives. >> >> Xinmin >> >> -----Original Message----- >> From:mehdi.amini at apple.com >> <mailto:mehdi.amini at apple.com>[mailto:mehdi.amini at apple.com] >> Sent: Tuesday, January 31, 2017 7:08 PM >> To: Tian, Xinmin <xinmin.tian at intel.com >> <mailto:xinmin.tian at intel.com>> >> Cc: Sanjoy Das <sanjoy at playingwithpointers.com >> <mailto:sanjoy at playingwithpointers.com>>; Adve, Vikram >> Sadanand <vadve at illinois.edu >> <mailto:vadve at illinois.edu>>;llvm-dev at lists.llvm.org >> <mailto:llvm-dev at lists.llvm.org>;llvm-dev-request at lists.llvm.org >> <mailto:llvm-dev-request at lists.llvm.org> >> Subject: Re: [llvm-dev] [RFC] IR-level Region Annotations >> >> >> >> >> On Jan 31, 2017, at 6:48 PM, Tian, Xinmin >> <xinmin.tian at intel.com <mailto:xinmin.tian at intel.com>> wrote: >> >> Let me try this. >> >> You can simply consider the prepare-phase (e.g. >> pre-privatization) were done in FE (actually a library >> can be used by multiple FEs at LLVM IR level), the >> region is run with 1 thread, region annotation (scope, >> single-entry-single-exit) as memory barrier >> conservatively for now (instead of checking individual >> memory dependency, aliasing via tags which is the actual >> implementation is done) marked with region intrinsic >> functions. What optimization will mess up with this >> region-annotation? >> >> >> The first thing that comes to my mind is inlining that can >> put the IR in a form that breaks the invariant you tried to >> enforce with your "prepare-phase” (for example by hoisting an >> allocas). >> >> — >> Mehdi >> >> >> >> >> -----Original Message----- >> >> From:mehdi.amini at apple.com >> <mailto:mehdi.amini at apple.com>[mailto:mehdi.amini at apple.com] >> >> Sent: Tuesday, January 31, 2017 5:47 PM >> >> To: Tian, Xinmin <xinmin.tian at intel.com >> <mailto:xinmin.tian at intel.com>> >> >> Cc: Sanjoy Das <sanjoy at playingwithpointers.com >> <mailto:sanjoy at playingwithpointers.com>>; Adve, Vikram >> Sadanand <vadve at illinois.edu >> <mailto:vadve at illinois.edu>>;llvm-dev at lists.llvm.org >> <mailto:llvm-dev at lists.llvm.org>;llvm-dev-request at lists.llvm.org >> <mailto:llvm-dev-request at lists.llvm.org> >> >> Subject: Re: [llvm-dev] [RFC] IR-level Region Annotations >> >> On Jan 31, 2017, at 5:38 PM, Tian, Xinmin >> <xinmin.tian at intel.com >> <mailto:xinmin.tian at intel.com>> wrote: >> >> Ok, but this looks like a >> “workaround" for your specific >> use-case, I don’t see how it can >> scale as a model-agnostic and >> general-purpose region semantic. >> >> I would say it is a design trade-off. >> >> I’m not sure if we’re talking about the same thing here: >> my understanding at this point is that the design >> trading-off you’re making “simplicity” by scarifying >> “correctness”. >> >> Requiring the IR to stay in what you’re calling a >> “canonical” form in your answer to Sanjoy in order to not >> miscompile a program is not an approach that seems >> compatible with how we deal with the IR usually. >> >> Regardless it is a new instruction or an intrinsics >> with token/tag, it will consist of model-agnostic >> part and model-non-agnostic part. The package comes >> with a framework for parsing and using these >> intrinsics. See the reply I had for Sanjoy's email. >> >> The answer to Sanjoy is not really helpful to clarify >> anything to me. At this point I still don’t understand >> how this is supposed to be correct in general. >> >> It would be helpful to have a LangRef patch that >> describes the semantic associated to your region >> intrinsics. Then we may be able to process some examples >> through the formalized description. >> >> — >> >> Mehdi >> >> -----Original Message----- >> >> From:mehdi.amini at apple.com >> <mailto:mehdi.amini at apple.com>[mailto:mehdi.amini at apple.com] >> >> Sent: Saturday, January 21, 2017 1:57 PM >> >> To: Tian, Xinmin <xinmin.tian at intel.com >> <mailto:xinmin.tian at intel.com>> >> >> Cc: Sanjoy Das <sanjoy at playingwithpointers.com >> <mailto:sanjoy at playingwithpointers.com>>; Adve, >> Vikram Sadanand <vadve at illinois.edu >> <mailto:vadve at illinois.edu>>;llvm-dev at lists.llvm.org >> <mailto:llvm-dev at lists.llvm.org>;llvm-dev-request at lists.llvm.org >> <mailto:llvm-dev-request at lists.llvm.org> >> >> Subject: Re: [llvm-dev] [RFC] IR-level Region Annotations >> >> On Jan 20, 2017, at 11:17 AM, Tian, Xinmin >> <xinmin.tian at intel.com >> <mailto:xinmin.tian at intel.com>> wrote: >> >> This means that the optimizer has >> to be aware of it, I’m missing >> the magic here? >> >> This is one option. >> >> The another option is that, as I mentioned in our >> LLVM-HPC paper in our implementation. We have a >> "prepare phase for pre-privatization" can be >> invoked by both Clang FE and Fortran FE right >> after LLVM IR is generated. So, in this way, we >> are able to minimize the optimizations impact for >> the original val and I >> >> Ok, but this looks like a “workaround" for your >> specific use-case, I don’t see how it can scale as a >> model-agnostic and general-purpose region semantic. >> >> The fact that you needed this pre-step in the first >> place seems to indicate to me that it confirms what >> multiple people expressed in this thread, for example >> what Daniel wrote >> here:http://lists.llvm.org/pipermail/llvm-dev/2017-January/108997.html >> >> — >> >> Mehdi >> >> { void main() { >> >> i32* val = alloca i32 >> >> i32* I = alloca 32 >> >> i32* priv_val = alloca i32 >> >> i32* priv_i alloca 32 >> >> tok = llvm.experimental.intrinsic_a()[ >> "DIR.PARALLEL"(),"QUAL.PRIVATE"(i32* %priv_val, >> i32 %priv_i), "QUAL.NUM_THREADS"(i32 4)] >> >> %priv_i = omp_get_thread_num(); >> >> compute_something_into_val(%priv_val, %priv_i); >> >> a[priv_i] = %priv_val; >> >> llvm.experimental.intrinsic_b(tok)["DIR.END.PARALLEL"()]; >> >> .... >> >> I >> >> Val >> >> Foo(val, i). >> >> } >> >> "Prepare phase" is our way of minimizing the >> impact to existing optimizations. >> >> Xinmin >> >> -----Original Message----- >> >> From:mehdi.amini at apple.com >> <mailto:mehdi.amini at apple.com>[mailto:mehdi.amini at apple.com] >> >> Sent: Friday, January 20, 2017 10:54 AM >> >> To: Tian, Xinmin >> >> Cc: Sanjoy Das; Adve, Vikram >> Sadanand;llvm-dev at lists.llvm.org >> <mailto:llvm-dev at lists.llvm.org>;llvm-dev-request at lists.llvm.org >> <mailto:llvm-dev-request at lists.llvm.org> >> >> Subject: Re: [llvm-dev] [RFC] IR-level Region >> Annotations >> >> On Jan 20, 2017, at 10:44 AM, Tian, Xinmin >> via llvm-dev <llvm-dev at lists.llvm.org >> <mailto:llvm-dev at lists.llvm.org>> wrote: >> >> Sanjoy, the IR would be like something below. >> It is ok to hoist alloca instruction outside >> the region. There are some small changes in >> optimizer to understand region-annotation >> intrinsic. >> >> { void main() { >> >> i32* val = alloca i32 >> >> tok = llvm.experimental.intrinsic_a()[ >> "DIR.PARALLEL"(), >> >> "QUAL.PRIVATE"(i32* val), >> "QUAL.NUM_THREADS"(i32 4)] >> >> int i = omp_get_thread_num(); >> >> compute_something_into_val(val, i); >> >> a[i] = val; >> >> llvm.experimental.intrinsic_b(tok)["DIR.END.PARALLEL"()]; >> >> } >> >> With above representation, we can do >> privatization and outlining as >> >> below >> >> { void main() { >> >> i32* val = alloca i32 >> >> i32* I = alloca 32 >> >> tok = llvm.experimental.intrinsic_a()[ >> "DIR.PARALLEL"(), >> >> "QUAL.PRIVATE"(i32* %val, i32 %i), >> "QUAL.NUM_THREADS"(i32 4)] >> >> %ii = omp_get_thread_num(); >> >> compute_something_into_val(%val, %i); a[i] >> %val; >> >> llvm.experimental.intrinsic_b(tok)["DIR.END.PARALLEL"()]; >> >> } >> >> Here we come to the interesting part: the >> hoisting of "i32* I = alloca 32” above the >> intrinsics required to update the intrinsics >> information “QUAL.PRIVATE”. >> >> This means that the optimizer has to be aware of >> it, I’m missing the magic here? >> >> I understand that an openmp specific optimization >> can do it, the question is how it an openmp >> agnostic supposed to behave in face of >> llvm.experimental.intrinsic_a? >> >> — >> >> Mehdi >> >> 1. create i32* priv_val = alloca i32 %priv_i >> = ...in the region, and replace all %val >> with %prv_val in the region. >> >> 2. perform outlining. >> >> Caller code >> >> .... >> >> omp_push_num_threads(4) >> >> omp_fork_call( .... outline_par_region....) .... >> >> Callee code: >> >> Outlined_par_rgion { >> >> I32* priv_val = alloca 32 >> >> I32* priv_i = .... >> >> Ret >> >> } >> >> For OpenMP, we do support it at -O0, -O1, -O2 >> and -O3. We had to make sure it runs >> correctly w/ and w/o optimizations and >> advanced analysis. So we need to preserve all >> source information for BE. >> >> You can take a look our LLVM-HPC paper for a >> bit some details. There are still tons of >> work to be done. Thanks. >> >> Xinmin >> >> -----Original Message----- >> >> From: llvm-dev >> [mailto:llvm-dev-bounces at lists.llvm.org] On >> Behalf Of Sanjoy Das via llvm-dev >> >> Sent: Thursday, January 19, 2017 10:13 PM >> >> To: Adve, Vikram Sadanand <vadve at illinois.edu >> <mailto:vadve at illinois.edu>> >> >> Cc: llvm-dev <llvm-dev at lists.llvm.org >> <mailto:llvm-dev at lists.llvm.org>>;llvm-dev-request at lists.llvm.org >> <mailto:llvm-dev-request at lists.llvm.org> >> >> Subject: Re: [llvm-dev] [RFC] IR-level Region >> Annotations >> >> Hi Vikram, >> >> On Thu, Jan 19, 2017 at 9:27 PM, Adve, Vikram >> Sadanand <vadve at illinois.edu >> <mailto:vadve at illinois.edu>> wrote: >> >> Hi Sanjoy, >> >> Yes, that's exactly what we have been >> looking at recently here, but >> >> the region tags seem to make it possible >> to express the control flow >> >> as well, so I think we could start with >> reg ions+metadata, as Hal and >> >> I'm not yet convinced that region tags are >> sufficient to model exotic control flow. >> >> (I don't know OpenMP so this is a >> copy-pasted-edited example) >> >> Say we have: >> >> void main() { >> >> #pragma omp parallel num_threads(4) >> >> { >> >> int i = omp_get_thread_num(); >> >> int val; >> >> compute_something_into_val(&val, i); >> >> a[i] = val; >> >> } >> >> } >> >> I presume the (eventual) intended lowering is >> something like this (if the intended lowering >> is different than this, and avoids the issue >> I'm trying to highlight then my point is moot): >> >> void main() { >> >> tok = llvm.experimental.intrinsic_a(); >> >> int i = omp_get_thread_num(); >> >> i32* val = alloca i32 >> >> compute_something_into_val(val, i); >> >> a[i] = val; >> >> llvm.experimental.intrinsic_b(tok); >> >> } >> >> However, LLVM is free to hoist the alloca to >> the entry block: >> >> void main() { >> >> i32* val = alloca i32 >> >> tok = llvm.experimental.intrinsic_a(); >> >> int i = omp_get_thread_num(); >> >> compute_something_into_val(val, i); >> >> a[i] = val; >> >> llvm.experimental.intrinsic_b(tok); >> >> } >> >> and now you have a race between the four >> parallel forks. >> >> The problem here is that nothing in the IR >> expresses that we have four copies of the >> region running "at the same time". In fact, >> such a control flow is alien to LLVM today. >> >> For instance, another evil optimization may turn: >> >> void main() { >> >> int a[4]; >> >> #pragma omp parallel num_threads(4) >> >> { >> >> int i = omp_get_thread_num(); >> >> int val = compute_something_into_val(i); >> >> a[i] = val; >> >> } >> >> return a[0] + a[1]; >> >> } >> >> to >> >> void main() { >> >> int a[4]; >> >> #pragma omp parallel num_threads(4) >> >> { >> >> int i = omp_get_thread_num(); >> >> int val = compute_something_into_val(i); >> >> a[i] = val; >> >> } >> >> return undef; >> >> } >> >> since a[i] = val could have initialized at >> most one element in a. >> >> Now you could say that the >> llvm.experimental.intrinsic_a and >> llvm.experimental.intrinsic_b intrinsics are >> magic, and even such "obvious" optimizations >> are not allowed to happen across them; but >> then calls to these intrinsics is pretty >> fundamentally different from "normal" calls, >> and are probably best modeled as new >> instructions. >> >> You're going to have to do the same kind of >> auditing of passes either way, and the only >> extra cost of a new instruction is the extra >> bitcode reading / writing code. >> >> I hope I made sense. >> >> -- Sanjoy >> >> Xinmin proposed, and then figure out what >> needs to be first class >> >> instructions. >> >> --Vikram Adve >> >> On Jan 19, 2017, at 11:03 PM, Sanjoy >> Das <sanjoy at playingwithpointers.com >> <mailto:sanjoy at playingwithpointers.com>> >> wrote: >> >> Hi, >> >> My bias is to use both (b) and (d), >> since they have complementary >> >> strengths. We should use (b) for >> expressing concepts that can't be >> >> semantically modeled as a call or >> invoke (this branch takes both its >> >> successors), and (d) for expressing >> things that can be (this call may >> >> never return), and annotation like >> things (this region (denoted by >> >> def-use of a token) is a reduction). >> >> I don't grok OpenMP, but perhaps we >> can come with one or two >> >> "generalized control flow"-type >> instructions that can be used to >> >> model the non-call/invoke like >> semantics we'd like LLVM to know >> >> about, and model the rest with (d)? >> >> -- Sanjoy >> >> On Thu, Jan 19, 2017 at 8:28 PM, Hal >> Finkel via llvm-dev >> >> <llvm-dev at lists.llvm.org >> <mailto:llvm-dev at lists.llvm.org>> wrote: >> >> On 01/19/2017 03:36 PM, Mehdi >> Amini via llvm-dev wrote: >> >> On Jan 19, 2017, at 1:32 PM, >> Daniel Berlin >> <dberlin at dberlin.org >> <mailto:dberlin at dberlin.org>> wrote: >> >> On Thu, Jan 19, 2017 at 1:12 >> PM, Mehdi Amini >> <mehdi.amini at apple.com >> <mailto:mehdi.amini at apple.com>> >> wrote: >> >> On Jan 19, 2017, at 12:04 PM, >> Daniel Berlin >> <dberlin at dberlin.org >> <mailto:dberlin at dberlin.org>> >> wrote: >> >> On Thu, Jan 19, 2017 at 11:46 >> AM, Mehdi Amini via llvm-dev >> >> <llvm-dev at lists.llvm.org >> <mailto:llvm-dev at lists.llvm.org>> >> wrote: >> >> On Jan 19, 2017, at >> 11:36 AM, Adve, >> Vikram Sadanand via >> llvm-dev >> >> <llvm-dev at lists.llvm.org >> <mailto:llvm-dev at lists.llvm.org>> >> wrote: >> >> Hi Johannes, >> >> I am especially >> curious where you >> get your data >> from. Tapir [0] >> >> (and to some >> degree PIR [1]) >> have shown that, >> >> counterintuitively, >> only a few >> changes to LLVM >> passes are >> >> needed. Tapir was >> recently used in >> an MIT class with >> a lot of >> >> students and it >> seemed to work >> well with only >> minimal changes to >> >> analysis and >> especially >> transformation >> passes. >> >> TAPIR is an elegant, >> small extension and, >> in particular, I think >> >> the idea of >> asymmetric parallel >> tasks and control >> flow is a >> >> clever way to express >> parallelism with >> serial semantics, as in >> >> Cilk. Encoding the >> control flow >> extensions as explicit >> >> instructions is >> orthogonal to that, >> though arguably more >> elegant than using >> region tags + metadata. >> >> However, Cilk is a >> tiny language >> compared with the full >> >> complexity of other >> languages, like >> OpenMP. To take just one >> >> example, TAPIR cannot >> express the ORDERED >> construct of OpenMP. A >> >> more serious concern, >> IMO, is that TAPIR >> (like Cilk) requires >> >> serial semantics, >> whereas there are >> many parallel >> languages, OpenMP >> included, that do not >> obey that restriction. >> >> Third, OpenMP has >> *numerous* clauses, >> e.g., REDUCTION or >> PRIVATE, >> >> that are needed >> because without that, >> you’d be dependent on >> >> fundamentally hard >> compiler analyses to >> extract the same >> >> information for >> satisfactory parallel >> performance; realistic >> >> applications cannot >> depend on the success >> of such analyses. >> >> I agree with this, but >> I’m also wondering if it >> needs to be first >> >> class in the IR? >> >> For example we know our >> alias analysis is very >> basic, and C/C++ >> >> have a higher constraint >> thanks to their type >> system, but we >> >> didn’t inject this higher >> level information that >> helps the >> >> optimizer as first class >> IR constructs. >> >> FWIW, while i agree with the >> general point, i wouldn't use >> this example. >> >> Because we pretty much still >> suffer to this day because of >> it (both >> >> in AA, and devirt, and ...) >> :) We can't always even tell >> fields >> >> apart >> >> Is it inherent to the >> infrastructure, i.e. using >> metadata instead >> >> of first class IR construct >> or is it just a “quality of >> implementation” issue? >> >> Not to derail this conversation: >> >> IMHO, At some point there is no >> real difference :) >> >> Because otherwise, everything is >> a QOI issue. >> >> IE if it's super tricky to get >> metadata that works well and works >> >> right, doesn't get lost, etc, and >> that's inherent to using metadata, >> >> that to me is not a QOI issue. >> >> So could it be done with >> metadata? Probably? >> >> But at the same time, if it had >> been done with more first class >> >> constructs, it would have >> happened years ago and been much >> lower cost. >> >> This is what I meant by “inherent >> to the infrastructure”, thanks for >> >> clarifying. >> >> To clarify, we were proposing >> metadata that is used as arguments to >> >> the region-annotation intrinsics. >> This metadata has the nice >> >> property that it does not get >> dropped (so it is just being used >> as a >> >> way of encoding whatever data >> structures are necessary without >> predefining a syntactic schema). >> >> -Hal >> >> — >> >> Mehdi >> >> _______________________________________________ >> >> LLVM Developers mailing list >> >> llvm-dev at lists.llvm.org >> <mailto:llvm-dev at lists.llvm.org> >> >> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >> >> -- >> >> Hal Finkel >> >> Lead, Compiler Technology and >> Programming Languages Leadership >> >> Computing Facility Argonne >> National Laboratory >> >> _______________________________________________ >> >> LLVM Developers mailing list >> >> llvm-dev at lists.llvm.org >> <mailto:llvm-dev at lists.llvm.org> >> >> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >> >> _______________________________________________ >> >> LLVM Developers mailing list >> >> llvm-dev at lists.llvm.org >> <mailto:llvm-dev at lists.llvm.org> >> >> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >> >> _______________________________________________ >> >> LLVM Developers mailing list >> >> llvm-dev at lists.llvm.org >> <mailto:llvm-dev at lists.llvm.org> >> >> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >> > > > > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev-- Hal Finkel Lead, Compiler Technology and Programming Languages Leadership Computing Facility Argonne National Laboratory -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170201/ad801a82/attachment-0001.html>