Tian, Xinmin via llvm-dev
2017-Jan-13 17:00 UTC
[llvm-dev] [RFC] IR-level Region Annotations
Mehdi, thanks for good questions.>>>>>Something isn’t clear to me about how do you preserve the validity of the region annotations since regular passes don’t know about the attached semantic?There are some small changes we have to make in some optimizations to make sure the optimizations does not validation attached annotation semantics. 1) provide hand-shaking / query utils for optimization to know the region is parallel loop, 2) setup a proper optimization phase ordering. In our product compiler ICC, we used both approaches.>>>>>For example, if a region is marking a loop as parallel from an OpenMP pragma, but a strength reduction transformation introduces a loop-carried dependency and thus invalidate the “parallel” semantic?Yes, there are a list of such cases, e.g. forward substitution, strength reduction, gloable constant propagation. Here is another example, under serial semantic, you can do constant propagation, but, under parallel semantics, we can't do constant propagation. All these issues are considered Int x = 100; parallel num_threads(4) { .... atomic { x = x + 600 } } These issues exists already when you do IPO optimization cross OpenCL or Cuda kernel functions, or outlined function from ClangFE.>>>>>Another issue is how much are these intrinsics acting as “barrier” for regular optimizations? For example what prevents reordering a loop such that it is executed *before* the intrinsic that mark the beginning of the region?ClangFE will need set the "convergent" attribute for the intrinsic calls (call side) based on the language construct semantics. Thanks, Xinmin -----Original Message----- From: llvm-dev [mailto:llvm-dev-bounces at lists.llvm.org] On Behalf Of Mehdi Amini via llvm-dev Sent: Thursday, January 12, 2017 11:07 PM To: Hal Finkel <hfinkel at anl.gov> Cc: llvm-dev <llvm-dev at lists.llvm.org> Subject: Re: [llvm-dev] [RFC] IR-level Region Annotations> On Jan 11, 2017, at 2:02 PM, Hal Finkel via llvm-dev <llvm-dev at lists.llvm.org> wrote: > > A Proposal for adding an experimental IR-level region-annotation > infrastructure > =====================================================================> ======= Hal Finkel (ANL) and Xinmin Tian (Intel) > > This is a proposal for adding an experimental infrastructure to > support annotating regions in LLVM IR, making use of intrinsics and > metadata, and a generic analysis to allow transformations to easily > make use of these annotated regions. This infrastructure is flexible > enough to support representation of directives for parallelization, > vectorization, and offloading of both loops and more-general code > regions. Under this scheme, the conceptual distance between > source-level directives and the region annotations need not be > significant, making the incremental cost of supporting new directives > and modifiers often small. It is not, however, specific to those use cases. > > Problem Statement > ================> There are a series of discussions on LLVM IR extensions for > representing region and loop annotations for parallelism, and other > user-guided transformations, among both industrial and academic > members of the LLVM community. Increasing the quality of our OpenMP > implementation is an important motivating use case, but certainly not > the only one. For OpenMP in particular, we've discussed having an IR > representation for years. Presently, all OpenMP pragmas are transformed directly into runtime-library calls in Clang, and outlining (i.e. > extracting parallel regions into their own functions to be invoked by > the runtime library) is done in Clang as well. Our implementation does > not further optimize OpenMP constructs, and a lot of thought has been > put into how we might improve this. For some optimizations, such as > redundant barrier removal, we could use a TargetLibraryInfo-like > mechanism to recognize frontend-generated runtime calls and proceed > from there. Dealing with cases where we lose pointer-aliasing > information, information on loop bounds, etc. we could improve by > improving our inter-procedural-analysis capabilities. We should do > that regardless. However, there are important cases where the > underlying scheme we want to use to lower the various parallelism constructs, especially when targeting accelerators, changes depending on what is in the parallel region. > In important cases where we can see everything (i.e. there aren't > arbitrary external calls), code generation should proceed in a way > that is very different from the general case. To have a sensible > implementation, this must be done after inlining. When using LTO, this should be done during the link-time phase. > As a result, we must move away from our purely-front-end based lowering scheme. > The question is what to do instead, and how to do it in a way that is > generally useful to the entire community. > > Designs previously discussed can be classified into four categories: > > (a) Add a large number of new kinds of LLVM metadata, and use them to annotate > each necessary instruction for parallelism, data attributes, etc. > (b) Add several new LLVM instructions such as, for parallelism, fork, spawn, > join, barrier, etc. > (c) Add a large number of LLVM intrinsics for directives and clauses, each > intrinsic representing a directive or a clause. > (d) Add a small number of LLVM intrinsics for region or loop annotations, > represent the directive/clause names using metadata and the remaining > information using arguments. > > Here we're proposing (d), and below is a brief pros and cons analysis > based on these discussions and our own experiences of supporting > region/loop annotations in LLVM-based compilers. The table below shows a short summary of our analysis. > > Various commercial compilers (e.g. from Intel, IBM, Cray, PGI), and > GCC [1,2], have IR-level representations for parallelism constructs. > Based on experience from these previous developments, we'd like a > solution for LLVM that maximizes optimization enablement while > minimizing the maintenance costs and complexity increase experienced by the community as a whole. > > Representing the desired information in the LLVM IR is just the first > step. The challenge is to maintain the desired semantics without > blocking useful optimizations. With options (c) and (d), dependencies > can be preserved mainly based on the use/def chain of the arguments of > each intrinsic, and a manageable set LLVM analysis and transformations > can be made aware of certain kinds of annotations in order to enable > specific optimizations. In this regard, options (c) and (d) are close > with respect to maintenance efforts. However, based on our > experiences, option (d) is preferable because it is easier to extend > to support new directives and clauses in the future without the need to add new intrinsics as required by option (c). > > Table 1. Pros/cons summary of LLVM IR experimental extension options > > --------+----------------------+-------------------------------------- > --------+----------------------+--------- > Options | Pros | Cons > --------+----------------------+-------------------------------------- > --------+----------------------+--------- > (a) | No need to add new | LLVM passes do not always maintain metadata. > | instructions or | Need to educate many passes (if not all) to > | new intrinsics | understand and handle them. > --------+----------------------+-------------------------------------- > --------+----------------------+--------- > (b) | Parallelism becomes | Huge effort for extending all LLVM passes and > | first class citizen | code generation to support new instructions. > | | A large set of information still needs to be > | | represented using other means. > --------+----------------------+-------------------------------------- > --------+----------------------+--------- > (c) | Less impact on the | A large number of intrinsics must be added. > | exist LLVM passes. | Some of the optimizations need to be > | Fewer requirements | educated to understand them. > | for passes to | > | maintain metadata. | > --------+----------------------+-------------------------------------- > --------+----------------------+--------- > (d) | Minimal impact on | Some of the optimizations need to be > | existing LLVM | educated to understand them. > | optimizations passes.| No requirements for all passes to maintain > | directive and clause | large set of metadata with values. > | names use metadata | > | strings. | > --------+----------------------+-------------------------------------- > --------+----------------------+--------- > > Regarding (a), LLVM already uses metadata for certain loop information (e.g. > annotations directing loop transformations and assertions about > loop-carried dependencies), but there is no natural or consistent way > to extend this scheme to represent necessary data-movement or region information. > > > New Intrinsics for Region and Value Annotations > =============================================> The following new (experimental) intrinsics are proposed which allow: > > a) Annotating a code region marked with directives / pragmas, > b) Annotating values associated with the region (or loops), that is, those > values associated with directives / pragmas. > c) Providing information on LLVM IR transformations needed for the annotated > code regions (or loops). > > These can be used both by frontends and also by transformation passes (e.g. > automated parallelization). The names used here are similar to those > used by our internal prototype, but obviously we expect a community > bikeshed discussion. > > def int_experimental_directive : Intrinsic<[], [llvm_metadata_ty], > [IntrArgMemOnly], > "llvm.experimental.directive">; > > def int_experimental_dir_qual : Intrinsic<[], [llvm_metadata_ty], > [IntrArgMemOnly], "llvm.experimental.dir.qual">; > > def int_experimental_dir_qual_opnd : Intrinsic<[], [llvm_metadata_ty, > llvm_any_ty], [IntrArgMemOnly], "llvm.experimental.dir.qual.opnd">; > > def int_experimental_dir_qual_opndlist : Intrinsic< > [], [llvm_metadata_ty, > llvm_vararg_ty], [IntrArgMemOnly], > "llvm.experimental.dir.qual.opndlist">; > > Note that calls to these intrinsics might need to be annotated with > the convergent attribute when they represent fork/join operations, > barriers, and similar. > > Usage Examples > =============> > This section shows a few examples using these experimental intrinsics. > LLVM developers who will use these intrinsics can defined their own MDstring. > All details of using these intrinsics on representing OpenMP 4.5 constructs are described in [1][3]. > > > Example I: An OpenMP combined construct > > #pragma omp target teams distribute parallel for simd loop > > LLVM IR > ------- > call void @llvm.experimental.directive(metadata !0) call void > @llvm.experimental.directive(metadata !1) call void > @llvm.experimental.directive(metadata !2) call void > @llvm.experimental.directive(metadata !3) loop call void > @llvm.experimental.directive(metadata !6) call void > @llvm.experimental.directive(metadata !5) call void > @llvm.experimental.directive(metadata !4) > > !0 = metadata !{metadata !DIR.OMP.TARGET} > !1 = metadata !{metadata !DIR.OMP.TEAMS} > !2 = metadata !{metadata !DIR.OMP.DISTRIBUTE.PARLOOP.SIMD} > > !6 = metadata !{metadata !DIR.OMP.END.DISTRIBUTE.PARLOOP.SIMD} > !5 = metadata !{metadata !DIR.OMP.END.TEAMS} > !4 = metadata !{metadata !DIR.OMP.END.TARGET}Something isn’t clear to me about how do you preserve the validity of the region annotations since regular passes don’t know about the attached semantic? For example, if a region is marking a loop as parallel from an OpenMP pragma, but a strength reduction transformation introduces a loop-carried dependency and thus invalidate the “parallel” semantic? Another issue is how much are these intrinsics acting as “barrier” for regular optimizations? For example what prevents reordering a loop such that it is executed *before* the intrinsic that mark the beginning of the region? I feel I missed a piece (but maybe I should start with the provided references?) :) — Mehdi> > Example II: Assume x,y,z are int variables, and s is a non-POD variable. > Then, lastprivate(x,y,s,z) is represented as: > > LLVM IR > ------- > call void @llvm.experimental.dir.qual.opndlist( > metadata !1, %x, %y, metadata !2, %a, %ctor, %dtor, %z) > > !1 = metadata !{metadata !QUAL.OMP.PRIVATE} > !2 = metadata !{metadata !QUAL.OPND.NONPOD} > > Example III: A prefetch pragma example > > // issue vprefetch1 for xp with a distance of 20 vectorized iterations ahead > // issue vprefetch0 for yp with a distance of 10 vectorized iterations ahead > #pragma prefetch x:1:20 y:0:10 > for (i=0; i<2*N; i++) { xp[i*m + j] = -1; yp[i*n +j] = -2; } > > LLVM IR > ------- > call void @llvm.experimental.directive(metadata !0) > call void @llvm.experimental.dir.qual.opnslist(metadata !1, %xp, 1, 20, > metadata !1, %yp, 0, 10) > loop > call void @llvm.experimental.directive(metadata !3) > > References > =========> > [1] LLVM Framework and IR extensions for Parallelization, SIMD Vectorization > and Offloading Support. SC'2016 LLVM-HPC3 Workshop. (Xinmin Tian et.al.) > Saltlake City, Utah. > > [2] Extending LoopVectorizer towards supporting OpenMP4.5 SIMD and outer loop > auto-vectorization. (Hideki Saito, et.al.) LLVM Developers' Meeting 2016, > San Jose. > > [3] Intrinsics, Metadata, and Attributes: The Story continues! (Hal Finkel) > LLVM Developers' Meeting, 2016. San Jose > > [4] LLVM Intrinsic Function and Metadata String Interface for Directive (or > Pragmas) Representation. Specification Draft v0.9, Intel Corporation, 2016. > > > Acknowledgements > ===============> We would like to thank Chandler Carruth (Google), Johannes Doerfert (Saarland > Univ.), Yaoqing Gao (HuaWei), Michael Wong (Codeplay), Ettore Tiotto, > Carlo Bertolli, Bardia Mahjour (IBM), and all other LLVM-HPC IR Extensions WG > members for their constructive feedback on the LLVM framework and IR extension > proposal. > > Proposed Implementation > ======================> > Two sets of patches of supporting these experimental intrinsics and demonstrate > the usage are ready for community review. > > a) Clang patches that support core OpenMP pragmas using this approach. > b) W-Region framework patches: CFG restructuring to form single-entry- > single-exit work region (W-Region) based on annotations, Demand-driven > intrinsic parsing, and WRegionInfo collection and analysis passes, > Dump functions of WRegionInfo. > > On top of this functionality, we will provide the transformation patches for > core OpenMP constructs (e.g. start with "#pragma omp parallel for" loop for > lowering and outlining, and "#pragma omp simd" to hook it up with > LoopVectorize.cpp). We have internal implementations for many constructs now. > We will break this functionality up to create a series of patches for > community review. > > -- > Hal Finkel > Lead, Compiler Technology and Programming Languages > Leadership Computing Facility > Argonne National Laboratory > > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev_______________________________________________ LLVM Developers mailing list llvm-dev at lists.llvm.org http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Mehdi Amini via llvm-dev
2017-Jan-13 17:16 UTC
[llvm-dev] [RFC] IR-level Region Annotations
Hi,> On Jan 13, 2017, at 9:00 AM, Tian, Xinmin <xinmin.tian at intel.com> wrote: > > Mehdi, thanks for good questions. > >>>>>> Something isn’t clear to me about how do you preserve the validity of the region annotations since regular passes don’t know about the attached semantic? > > There are some small changes we have to make in some optimizations to make sure the optimizations does not validation attached annotation semantics.I fear that this does not seem to play well with the original claim of the RFC about a “minimal impact" on existing passes. Especially since Hal mentioned “the motivation here is to support frontends inserting custom region annotations”, it is not clear if we wouldn’t have to teach passes to treat the intrinsics as optimization barriers by default (which kind of defeat the whole point about this), and then teach passes about the semantic of each kind of region. It may be possible to abstract some properties about region, à la TTI, with hooks that the passes would query. But that seems like something that’d need a lot of scrutiny before being able to evaluate the viability of the design.> 1) provide hand-shaking / query utils for optimization to know the region is parallel loop, 2) setup a proper optimization phase ordering. In our product compiler ICC, we used both approaches. > >>>>>> For example, if a region is marking a loop as parallel from an OpenMP pragma, but a strength reduction transformation introduces a loop-carried dependency and thus invalidate the “parallel” semantic? > > Yes, there are a list of such cases, e.g. forward substitution, strength reduction, gloable constant propagation. Here is another example, under serial semantic, you can do constant propagation, but, under parallel semantics, we can't do constant propagation. All these issues are considered > > Int x = 100; > > parallel num_threads(4) > { > .... > atomic { > x = x + 600 > } > } > > These issues exists already when you do IPO optimization cross OpenCL or Cuda kernel functions, or outlined function from ClangFE.Right but fortunately there are only a few passes to teach about IPO, and we already have generic mechanism to inhibit IPO, which is not the case with peephole or other function passes.> >>>>>> Another issue is how much are these intrinsics acting as “barrier” for regular optimizations? For example what prevents reordering a loop such that it is executed *before* the intrinsic that mark the beginning of the region? > > ClangFE will need set the "convergent" attribute for the intrinsic calls (call side) based on the language construct semantics.Convergent does not prevent reordering AFAIK: convergent call llvm.region.begin(“parallel.omp.for”) for (I : 0->N) a[I] = b[I] + c[I]; convergent call llvm.region.end(“parallel.omp.for") Can become: for (I : 0->N) a[I] = b[I] + c[I]; convergent call llvm.region.begin(“parallel.omp.for”) convergent call llvm.region.end(“parallel.omp.for") — Mehdi> > Thanks, > Xinmin > > > -----Original Message----- > From: llvm-dev [mailto:llvm-dev-bounces at lists.llvm.org] On Behalf Of Mehdi Amini via llvm-dev > Sent: Thursday, January 12, 2017 11:07 PM > To: Hal Finkel <hfinkel at anl.gov> > Cc: llvm-dev <llvm-dev at lists.llvm.org> > Subject: Re: [llvm-dev] [RFC] IR-level Region Annotations > > >> On Jan 11, 2017, at 2:02 PM, Hal Finkel via llvm-dev <llvm-dev at lists.llvm.org> wrote: >> >> A Proposal for adding an experimental IR-level region-annotation >> infrastructure >> =====================================================================>> ======= Hal Finkel (ANL) and Xinmin Tian (Intel) >> >> This is a proposal for adding an experimental infrastructure to >> support annotating regions in LLVM IR, making use of intrinsics and >> metadata, and a generic analysis to allow transformations to easily >> make use of these annotated regions. This infrastructure is flexible >> enough to support representation of directives for parallelization, >> vectorization, and offloading of both loops and more-general code >> regions. Under this scheme, the conceptual distance between >> source-level directives and the region annotations need not be >> significant, making the incremental cost of supporting new directives >> and modifiers often small. It is not, however, specific to those use cases. >> >> Problem Statement >> ================>> There are a series of discussions on LLVM IR extensions for >> representing region and loop annotations for parallelism, and other >> user-guided transformations, among both industrial and academic >> members of the LLVM community. Increasing the quality of our OpenMP >> implementation is an important motivating use case, but certainly not >> the only one. For OpenMP in particular, we've discussed having an IR >> representation for years. Presently, all OpenMP pragmas are transformed directly into runtime-library calls in Clang, and outlining (i.e. >> extracting parallel regions into their own functions to be invoked by >> the runtime library) is done in Clang as well. Our implementation does >> not further optimize OpenMP constructs, and a lot of thought has been >> put into how we might improve this. For some optimizations, such as >> redundant barrier removal, we could use a TargetLibraryInfo-like >> mechanism to recognize frontend-generated runtime calls and proceed >> from there. Dealing with cases where we lose pointer-aliasing >> information, information on loop bounds, etc. we could improve by >> improving our inter-procedural-analysis capabilities. We should do >> that regardless. However, there are important cases where the >> underlying scheme we want to use to lower the various parallelism constructs, especially when targeting accelerators, changes depending on what is in the parallel region. >> In important cases where we can see everything (i.e. there aren't >> arbitrary external calls), code generation should proceed in a way >> that is very different from the general case. To have a sensible >> implementation, this must be done after inlining. When using LTO, this should be done during the link-time phase. >> As a result, we must move away from our purely-front-end based lowering scheme. >> The question is what to do instead, and how to do it in a way that is >> generally useful to the entire community. >> >> Designs previously discussed can be classified into four categories: >> >> (a) Add a large number of new kinds of LLVM metadata, and use them to annotate >> each necessary instruction for parallelism, data attributes, etc. >> (b) Add several new LLVM instructions such as, for parallelism, fork, spawn, >> join, barrier, etc. >> (c) Add a large number of LLVM intrinsics for directives and clauses, each >> intrinsic representing a directive or a clause. >> (d) Add a small number of LLVM intrinsics for region or loop annotations, >> represent the directive/clause names using metadata and the remaining >> information using arguments. >> >> Here we're proposing (d), and below is a brief pros and cons analysis >> based on these discussions and our own experiences of supporting >> region/loop annotations in LLVM-based compilers. The table below shows a short summary of our analysis. >> >> Various commercial compilers (e.g. from Intel, IBM, Cray, PGI), and >> GCC [1,2], have IR-level representations for parallelism constructs. >> Based on experience from these previous developments, we'd like a >> solution for LLVM that maximizes optimization enablement while >> minimizing the maintenance costs and complexity increase experienced by the community as a whole. >> >> Representing the desired information in the LLVM IR is just the first >> step. The challenge is to maintain the desired semantics without >> blocking useful optimizations. With options (c) and (d), dependencies >> can be preserved mainly based on the use/def chain of the arguments of >> each intrinsic, and a manageable set LLVM analysis and transformations >> can be made aware of certain kinds of annotations in order to enable >> specific optimizations. In this regard, options (c) and (d) are close >> with respect to maintenance efforts. However, based on our >> experiences, option (d) is preferable because it is easier to extend >> to support new directives and clauses in the future without the need to add new intrinsics as required by option (c). >> >> Table 1. Pros/cons summary of LLVM IR experimental extension options >> >> --------+----------------------+-------------------------------------- >> --------+----------------------+--------- >> Options | Pros | Cons >> --------+----------------------+-------------------------------------- >> --------+----------------------+--------- >> (a) | No need to add new | LLVM passes do not always maintain metadata. >> | instructions or | Need to educate many passes (if not all) to >> | new intrinsics | understand and handle them. >> --------+----------------------+-------------------------------------- >> --------+----------------------+--------- >> (b) | Parallelism becomes | Huge effort for extending all LLVM passes and >> | first class citizen | code generation to support new instructions. >> | | A large set of information still needs to be >> | | represented using other means. >> --------+----------------------+-------------------------------------- >> --------+----------------------+--------- >> (c) | Less impact on the | A large number of intrinsics must be added. >> | exist LLVM passes. | Some of the optimizations need to be >> | Fewer requirements | educated to understand them. >> | for passes to | >> | maintain metadata. | >> --------+----------------------+-------------------------------------- >> --------+----------------------+--------- >> (d) | Minimal impact on | Some of the optimizations need to be >> | existing LLVM | educated to understand them. >> | optimizations passes.| No requirements for all passes to maintain >> | directive and clause | large set of metadata with values. >> | names use metadata | >> | strings. | >> --------+----------------------+-------------------------------------- >> --------+----------------------+--------- >> >> Regarding (a), LLVM already uses metadata for certain loop information (e.g. >> annotations directing loop transformations and assertions about >> loop-carried dependencies), but there is no natural or consistent way >> to extend this scheme to represent necessary data-movement or region information. >> >> >> New Intrinsics for Region and Value Annotations >> =============================================>> The following new (experimental) intrinsics are proposed which allow: >> >> a) Annotating a code region marked with directives / pragmas, >> b) Annotating values associated with the region (or loops), that is, those >> values associated with directives / pragmas. >> c) Providing information on LLVM IR transformations needed for the annotated >> code regions (or loops). >> >> These can be used both by frontends and also by transformation passes (e.g. >> automated parallelization). The names used here are similar to those >> used by our internal prototype, but obviously we expect a community >> bikeshed discussion. >> >> def int_experimental_directive : Intrinsic<[], [llvm_metadata_ty], >> [IntrArgMemOnly], >> "llvm.experimental.directive">; >> >> def int_experimental_dir_qual : Intrinsic<[], [llvm_metadata_ty], >> [IntrArgMemOnly], "llvm.experimental.dir.qual">; >> >> def int_experimental_dir_qual_opnd : Intrinsic<[], [llvm_metadata_ty, >> llvm_any_ty], [IntrArgMemOnly], "llvm.experimental.dir.qual.opnd">; >> >> def int_experimental_dir_qual_opndlist : Intrinsic< >> [], [llvm_metadata_ty, >> llvm_vararg_ty], [IntrArgMemOnly], >> "llvm.experimental.dir.qual.opndlist">; >> >> Note that calls to these intrinsics might need to be annotated with >> the convergent attribute when they represent fork/join operations, >> barriers, and similar. >> >> Usage Examples >> =============>> >> This section shows a few examples using these experimental intrinsics. >> LLVM developers who will use these intrinsics can defined their own MDstring. >> All details of using these intrinsics on representing OpenMP 4.5 constructs are described in [1][3]. >> >> >> Example I: An OpenMP combined construct >> >> #pragma omp target teams distribute parallel for simd loop >> >> LLVM IR >> ------- >> call void @llvm.experimental.directive(metadata !0) call void >> @llvm.experimental.directive(metadata !1) call void >> @llvm.experimental.directive(metadata !2) call void >> @llvm.experimental.directive(metadata !3) loop call void >> @llvm.experimental.directive(metadata !6) call void >> @llvm.experimental.directive(metadata !5) call void >> @llvm.experimental.directive(metadata !4) >> >> !0 = metadata !{metadata !DIR.OMP.TARGET} >> !1 = metadata !{metadata !DIR.OMP.TEAMS} >> !2 = metadata !{metadata !DIR.OMP.DISTRIBUTE.PARLOOP.SIMD} >> >> !6 = metadata !{metadata !DIR.OMP.END.DISTRIBUTE.PARLOOP.SIMD} >> !5 = metadata !{metadata !DIR.OMP.END.TEAMS} >> !4 = metadata !{metadata !DIR.OMP.END.TARGET} > > > Something isn’t clear to me about how do you preserve the validity of the region annotations since regular passes don’t know about the attached semantic? > > For example, if a region is marking a loop as parallel from an OpenMP pragma, but a strength reduction transformation introduces a loop-carried dependency and thus invalidate the “parallel” semantic? > > Another issue is how much are these intrinsics acting as “barrier” for regular optimizations? For example what prevents reordering a loop such that it is executed *before* the intrinsic that mark the beginning of the region? > > I feel I missed a piece (but maybe I should start with the provided references?) :) > > — > Mehdi > > > >> >> Example II: Assume x,y,z are int variables, and s is a non-POD variable. >> Then, lastprivate(x,y,s,z) is represented as: >> >> LLVM IR >> ------- >> call void @llvm.experimental.dir.qual.opndlist( >> metadata !1, %x, %y, metadata !2, %a, %ctor, %dtor, %z) >> >> !1 = metadata !{metadata !QUAL.OMP.PRIVATE} >> !2 = metadata !{metadata !QUAL.OPND.NONPOD} >> >> Example III: A prefetch pragma example >> >> // issue vprefetch1 for xp with a distance of 20 vectorized iterations ahead >> // issue vprefetch0 for yp with a distance of 10 vectorized iterations ahead >> #pragma prefetch x:1:20 y:0:10 >> for (i=0; i<2*N; i++) { xp[i*m + j] = -1; yp[i*n +j] = -2; } >> >> LLVM IR >> ------- >> call void @llvm.experimental.directive(metadata !0) >> call void @llvm.experimental.dir.qual.opnslist(metadata !1, %xp, 1, 20, >> metadata !1, %yp, 0, 10) >> loop >> call void @llvm.experimental.directive(metadata !3) >> >> References >> =========>> >> [1] LLVM Framework and IR extensions for Parallelization, SIMD Vectorization >> and Offloading Support. SC'2016 LLVM-HPC3 Workshop. (Xinmin Tian et.al.) >> Saltlake City, Utah. >> >> [2] Extending LoopVectorizer towards supporting OpenMP4.5 SIMD and outer loop >> auto-vectorization. (Hideki Saito, et.al.) LLVM Developers' Meeting 2016, >> San Jose. >> >> [3] Intrinsics, Metadata, and Attributes: The Story continues! (Hal Finkel) >> LLVM Developers' Meeting, 2016. San Jose >> >> [4] LLVM Intrinsic Function and Metadata String Interface for Directive (or >> Pragmas) Representation. Specification Draft v0.9, Intel Corporation, 2016. >> >> >> Acknowledgements >> ===============>> We would like to thank Chandler Carruth (Google), Johannes Doerfert (Saarland >> Univ.), Yaoqing Gao (HuaWei), Michael Wong (Codeplay), Ettore Tiotto, >> Carlo Bertolli, Bardia Mahjour (IBM), and all other LLVM-HPC IR Extensions WG >> members for their constructive feedback on the LLVM framework and IR extension >> proposal. >> >> Proposed Implementation >> ======================>> >> Two sets of patches of supporting these experimental intrinsics and demonstrate >> the usage are ready for community review. >> >> a) Clang patches that support core OpenMP pragmas using this approach. >> b) W-Region framework patches: CFG restructuring to form single-entry- >> single-exit work region (W-Region) based on annotations, Demand-driven >> intrinsic parsing, and WRegionInfo collection and analysis passes, >> Dump functions of WRegionInfo. >> >> On top of this functionality, we will provide the transformation patches for >> core OpenMP constructs (e.g. start with "#pragma omp parallel for" loop for >> lowering and outlining, and "#pragma omp simd" to hook it up with >> LoopVectorize.cpp). We have internal implementations for many constructs now. >> We will break this functionality up to create a series of patches for >> community review. >> >> -- >> Hal Finkel >> Lead, Compiler Technology and Programming Languages >> Leadership Computing Facility >> Argonne National Laboratory >> >> _______________________________________________ >> LLVM Developers mailing list >> llvm-dev at lists.llvm.org >> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev > > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Tian, Xinmin via llvm-dev
2017-Jan-13 17:45 UTC
[llvm-dev] [RFC] IR-level Region Annotations
The "minimal" is the most desired goal. But, <1000 LOC changes is better than 5000 LOC changes, right? For the RFC, The intrinsic comes with use-def of a, b, c and memory read/write. So far, no re-ordering happens as you showed with -O2 and O3 without any changes in other passes, I am not say we don't need to make any change in the future. convergent call llvm.region.begin(“parallel.omp.for”) (a, b, c) // either use argument or Taken with tags. for (I : 0->N) a[I] = b[I] + c[I]; convergent call llvm.region.end(“parallel.omp.for") A question back to you X[i] = ... Convergent call @llvm.Barrier(..) Did you see the re-ordering happening to move barrier before x[i] = ...?>>>>Right but fortunately there are only a few passes to teach about IPO, and we already have generic mechanism to inhibit IPO, which is not the case with peephole or other function passes.Agreed, we will need changes in peephole or function passes. But, it is reasonably manageable. In our product compiler, it is ~500 LOC overall. -----Original Message----- From: mehdi.amini at apple.com [mailto:mehdi.amini at apple.com] Sent: Friday, January 13, 2017 9:17 AM To: Tian, Xinmin <xinmin.tian at intel.com> Cc: Hal Finkel <hfinkel at anl.gov>; llvm-dev at lists.llvm.org Subject: Re: [llvm-dev] [RFC] IR-level Region Annotations Hi,> On Jan 13, 2017, at 9:00 AM, Tian, Xinmin <xinmin.tian at intel.com> wrote: > > Mehdi, thanks for good questions. > >>>>>> Something isn’t clear to me about how do you preserve the validity of the region annotations since regular passes don’t know about the attached semantic? > > There are some small changes we have to make in some optimizations to make sure the optimizations does not validation attached annotation semantics.I fear that this does not seem to play well with the original claim of the RFC about a “minimal impact" on existing passes. Especially since Hal mentioned “the motivation here is to support frontends inserting custom region annotations”, it is not clear if we wouldn’t have to teach passes to treat the intrinsics as optimization barriers by default (which kind of defeat the whole point about this), and then teach passes about the semantic of each kind of region. It may be possible to abstract some properties about region, à la TTI, with hooks that the passes would query. But that seems like something that’d need a lot of scrutiny before being able to evaluate the viability of the design.> 1) provide hand-shaking / query utils for optimization to know the region is parallel loop, 2) setup a proper optimization phase ordering. In our product compiler ICC, we used both approaches. > >>>>>> For example, if a region is marking a loop as parallel from an OpenMP pragma, but a strength reduction transformation introduces a loop-carried dependency and thus invalidate the “parallel” semantic? > > Yes, there are a list of such cases, e.g. forward substitution, strength reduction, gloable constant propagation. Here is another example, under serial semantic, you can do constant propagation, but, under parallel semantics, we can't do constant propagation. All these issues are considered > > Int x = 100; > > parallel num_threads(4) > { > .... > atomic { > x = x + 600 > } > } > > These issues exists already when you do IPO optimization cross OpenCL or Cuda kernel functions, or outlined function from ClangFE.Right but fortunately there are only a few passes to teach about IPO, and we already have generic mechanism to inhibit IPO, which is not the case with peephole or other function passes.> >>>>>> Another issue is how much are these intrinsics acting as “barrier” for regular optimizations? For example what prevents reordering a loop such that it is executed *before* the intrinsic that mark the beginning of the region? > > ClangFE will need set the "convergent" attribute for the intrinsic calls (call side) based on the language construct semantics.Convergent does not prevent reordering AFAIK: convergent call llvm.region.begin(“parallel.omp.for”) for (I : 0->N) a[I] = b[I] + c[I]; convergent call llvm.region.end(“parallel.omp.for") Can become: for (I : 0->N) a[I] = b[I] + c[I]; convergent call llvm.region.begin(“parallel.omp.for”) convergent call llvm.region.end(“parallel.omp.for") — Mehdi> > Thanks, > Xinmin > > > -----Original Message----- > From: llvm-dev [mailto:llvm-dev-bounces at lists.llvm.org] On Behalf Of Mehdi Amini via llvm-dev > Sent: Thursday, January 12, 2017 11:07 PM > To: Hal Finkel <hfinkel at anl.gov> > Cc: llvm-dev <llvm-dev at lists.llvm.org> > Subject: Re: [llvm-dev] [RFC] IR-level Region Annotations > > >> On Jan 11, 2017, at 2:02 PM, Hal Finkel via llvm-dev <llvm-dev at lists.llvm.org> wrote: >> >> A Proposal for adding an experimental IR-level region-annotation >> infrastructure >> =====================================================================>> ======= Hal Finkel (ANL) and Xinmin Tian (Intel) >> >> This is a proposal for adding an experimental infrastructure to >> support annotating regions in LLVM IR, making use of intrinsics and >> metadata, and a generic analysis to allow transformations to easily >> make use of these annotated regions. This infrastructure is flexible >> enough to support representation of directives for parallelization, >> vectorization, and offloading of both loops and more-general code >> regions. Under this scheme, the conceptual distance between >> source-level directives and the region annotations need not be >> significant, making the incremental cost of supporting new directives >> and modifiers often small. It is not, however, specific to those use cases. >> >> Problem Statement >> ================>> There are a series of discussions on LLVM IR extensions for >> representing region and loop annotations for parallelism, and other >> user-guided transformations, among both industrial and academic >> members of the LLVM community. Increasing the quality of our OpenMP >> implementation is an important motivating use case, but certainly not >> the only one. For OpenMP in particular, we've discussed having an IR >> representation for years. Presently, all OpenMP pragmas are transformed directly into runtime-library calls in Clang, and outlining (i.e. >> extracting parallel regions into their own functions to be invoked by >> the runtime library) is done in Clang as well. Our implementation does >> not further optimize OpenMP constructs, and a lot of thought has been >> put into how we might improve this. For some optimizations, such as >> redundant barrier removal, we could use a TargetLibraryInfo-like >> mechanism to recognize frontend-generated runtime calls and proceed >> from there. Dealing with cases where we lose pointer-aliasing >> information, information on loop bounds, etc. we could improve by >> improving our inter-procedural-analysis capabilities. We should do >> that regardless. However, there are important cases where the >> underlying scheme we want to use to lower the various parallelism constructs, especially when targeting accelerators, changes depending on what is in the parallel region. >> In important cases where we can see everything (i.e. there aren't >> arbitrary external calls), code generation should proceed in a way >> that is very different from the general case. To have a sensible >> implementation, this must be done after inlining. When using LTO, this should be done during the link-time phase. >> As a result, we must move away from our purely-front-end based lowering scheme. >> The question is what to do instead, and how to do it in a way that is >> generally useful to the entire community. >> >> Designs previously discussed can be classified into four categories: >> >> (a) Add a large number of new kinds of LLVM metadata, and use them to annotate >> each necessary instruction for parallelism, data attributes, etc. >> (b) Add several new LLVM instructions such as, for parallelism, fork, spawn, >> join, barrier, etc. >> (c) Add a large number of LLVM intrinsics for directives and clauses, each >> intrinsic representing a directive or a clause. >> (d) Add a small number of LLVM intrinsics for region or loop annotations, >> represent the directive/clause names using metadata and the remaining >> information using arguments. >> >> Here we're proposing (d), and below is a brief pros and cons analysis >> based on these discussions and our own experiences of supporting >> region/loop annotations in LLVM-based compilers. The table below shows a short summary of our analysis. >> >> Various commercial compilers (e.g. from Intel, IBM, Cray, PGI), and >> GCC [1,2], have IR-level representations for parallelism constructs. >> Based on experience from these previous developments, we'd like a >> solution for LLVM that maximizes optimization enablement while >> minimizing the maintenance costs and complexity increase experienced by the community as a whole. >> >> Representing the desired information in the LLVM IR is just the first >> step. The challenge is to maintain the desired semantics without >> blocking useful optimizations. With options (c) and (d), dependencies >> can be preserved mainly based on the use/def chain of the arguments of >> each intrinsic, and a manageable set LLVM analysis and transformations >> can be made aware of certain kinds of annotations in order to enable >> specific optimizations. In this regard, options (c) and (d) are close >> with respect to maintenance efforts. However, based on our >> experiences, option (d) is preferable because it is easier to extend >> to support new directives and clauses in the future without the need to add new intrinsics as required by option (c). >> >> Table 1. Pros/cons summary of LLVM IR experimental extension options >> >> --------+----------------------+-------------------------------------- >> --------+----------------------+--------- >> Options | Pros | Cons >> --------+----------------------+-------------------------------------- >> --------+----------------------+--------- >> (a) | No need to add new | LLVM passes do not always maintain metadata. >> | instructions or | Need to educate many passes (if not all) to >> | new intrinsics | understand and handle them. >> --------+----------------------+-------------------------------------- >> --------+----------------------+--------- >> (b) | Parallelism becomes | Huge effort for extending all LLVM passes and >> | first class citizen | code generation to support new instructions. >> | | A large set of information still needs to be >> | | represented using other means. >> --------+----------------------+-------------------------------------- >> --------+----------------------+--------- >> (c) | Less impact on the | A large number of intrinsics must be added. >> | exist LLVM passes. | Some of the optimizations need to be >> | Fewer requirements | educated to understand them. >> | for passes to | >> | maintain metadata. | >> --------+----------------------+-------------------------------------- >> --------+----------------------+--------- >> (d) | Minimal impact on | Some of the optimizations need to be >> | existing LLVM | educated to understand them. >> | optimizations passes.| No requirements for all passes to maintain >> | directive and clause | large set of metadata with values. >> | names use metadata | >> | strings. | >> --------+----------------------+-------------------------------------- >> --------+----------------------+--------- >> >> Regarding (a), LLVM already uses metadata for certain loop information (e.g. >> annotations directing loop transformations and assertions about >> loop-carried dependencies), but there is no natural or consistent way >> to extend this scheme to represent necessary data-movement or region information. >> >> >> New Intrinsics for Region and Value Annotations >> =============================================>> The following new (experimental) intrinsics are proposed which allow: >> >> a) Annotating a code region marked with directives / pragmas, >> b) Annotating values associated with the region (or loops), that is, those >> values associated with directives / pragmas. >> c) Providing information on LLVM IR transformations needed for the annotated >> code regions (or loops). >> >> These can be used both by frontends and also by transformation passes (e.g. >> automated parallelization). The names used here are similar to those >> used by our internal prototype, but obviously we expect a community >> bikeshed discussion. >> >> def int_experimental_directive : Intrinsic<[], [llvm_metadata_ty], >> [IntrArgMemOnly], >> "llvm.experimental.directive">; >> >> def int_experimental_dir_qual : Intrinsic<[], [llvm_metadata_ty], >> [IntrArgMemOnly], "llvm.experimental.dir.qual">; >> >> def int_experimental_dir_qual_opnd : Intrinsic<[], [llvm_metadata_ty, >> llvm_any_ty], [IntrArgMemOnly], "llvm.experimental.dir.qual.opnd">; >> >> def int_experimental_dir_qual_opndlist : Intrinsic< >> [], [llvm_metadata_ty, >> llvm_vararg_ty], [IntrArgMemOnly], >> "llvm.experimental.dir.qual.opndlist">; >> >> Note that calls to these intrinsics might need to be annotated with >> the convergent attribute when they represent fork/join operations, >> barriers, and similar. >> >> Usage Examples >> =============>> >> This section shows a few examples using these experimental intrinsics. >> LLVM developers who will use these intrinsics can defined their own MDstring. >> All details of using these intrinsics on representing OpenMP 4.5 constructs are described in [1][3]. >> >> >> Example I: An OpenMP combined construct >> >> #pragma omp target teams distribute parallel for simd loop >> >> LLVM IR >> ------- >> call void @llvm.experimental.directive(metadata !0) call void >> @llvm.experimental.directive(metadata !1) call void >> @llvm.experimental.directive(metadata !2) call void >> @llvm.experimental.directive(metadata !3) loop call void >> @llvm.experimental.directive(metadata !6) call void >> @llvm.experimental.directive(metadata !5) call void >> @llvm.experimental.directive(metadata !4) >> >> !0 = metadata !{metadata !DIR.OMP.TARGET} >> !1 = metadata !{metadata !DIR.OMP.TEAMS} >> !2 = metadata !{metadata !DIR.OMP.DISTRIBUTE.PARLOOP.SIMD} >> >> !6 = metadata !{metadata !DIR.OMP.END.DISTRIBUTE.PARLOOP.SIMD} >> !5 = metadata !{metadata !DIR.OMP.END.TEAMS} >> !4 = metadata !{metadata !DIR.OMP.END.TARGET} > > > Something isn’t clear to me about how do you preserve the validity of the region annotations since regular passes don’t know about the attached semantic? > > For example, if a region is marking a loop as parallel from an OpenMP pragma, but a strength reduction transformation introduces a loop-carried dependency and thus invalidate the “parallel” semantic? > > Another issue is how much are these intrinsics acting as “barrier” for regular optimizations? For example what prevents reordering a loop such that it is executed *before* the intrinsic that mark the beginning of the region? > > I feel I missed a piece (but maybe I should start with the provided references?) :) > > — > Mehdi > > > >> >> Example II: Assume x,y,z are int variables, and s is a non-POD variable. >> Then, lastprivate(x,y,s,z) is represented as: >> >> LLVM IR >> ------- >> call void @llvm.experimental.dir.qual.opndlist( >> metadata !1, %x, %y, metadata !2, %a, %ctor, %dtor, %z) >> >> !1 = metadata !{metadata !QUAL.OMP.PRIVATE} >> !2 = metadata !{metadata !QUAL.OPND.NONPOD} >> >> Example III: A prefetch pragma example >> >> // issue vprefetch1 for xp with a distance of 20 vectorized iterations ahead >> // issue vprefetch0 for yp with a distance of 10 vectorized iterations ahead >> #pragma prefetch x:1:20 y:0:10 >> for (i=0; i<2*N; i++) { xp[i*m + j] = -1; yp[i*n +j] = -2; } >> >> LLVM IR >> ------- >> call void @llvm.experimental.directive(metadata !0) >> call void @llvm.experimental.dir.qual.opnslist(metadata !1, %xp, 1, 20, >> metadata !1, %yp, 0, 10) >> loop >> call void @llvm.experimental.directive(metadata !3) >> >> References >> =========>> >> [1] LLVM Framework and IR extensions for Parallelization, SIMD Vectorization >> and Offloading Support. SC'2016 LLVM-HPC3 Workshop. (Xinmin Tian et.al.) >> Saltlake City, Utah. >> >> [2] Extending LoopVectorizer towards supporting OpenMP4.5 SIMD and outer loop >> auto-vectorization. (Hideki Saito, et.al.) LLVM Developers' Meeting 2016, >> San Jose. >> >> [3] Intrinsics, Metadata, and Attributes: The Story continues! (Hal Finkel) >> LLVM Developers' Meeting, 2016. San Jose >> >> [4] LLVM Intrinsic Function and Metadata String Interface for Directive (or >> Pragmas) Representation. Specification Draft v0.9, Intel Corporation, 2016. >> >> >> Acknowledgements >> ===============>> We would like to thank Chandler Carruth (Google), Johannes Doerfert (Saarland >> Univ.), Yaoqing Gao (HuaWei), Michael Wong (Codeplay), Ettore Tiotto, >> Carlo Bertolli, Bardia Mahjour (IBM), and all other LLVM-HPC IR Extensions WG >> members for their constructive feedback on the LLVM framework and IR extension >> proposal. >> >> Proposed Implementation >> ======================>> >> Two sets of patches of supporting these experimental intrinsics and demonstrate >> the usage are ready for community review. >> >> a) Clang patches that support core OpenMP pragmas using this approach. >> b) W-Region framework patches: CFG restructuring to form single-entry- >> single-exit work region (W-Region) based on annotations, Demand-driven >> intrinsic parsing, and WRegionInfo collection and analysis passes, >> Dump functions of WRegionInfo. >> >> On top of this functionality, we will provide the transformation patches for >> core OpenMP constructs (e.g. start with "#pragma omp parallel for" loop for >> lowering and outlining, and "#pragma omp simd" to hook it up with >> LoopVectorize.cpp). We have internal implementations for many constructs now. >> We will break this functionality up to create a series of patches for >> community review. >> >> -- >> Hal Finkel >> Lead, Compiler Technology and Programming Languages >> Leadership Computing Facility >> Argonne National Laboratory >> >> _______________________________________________ >> LLVM Developers mailing list >> llvm-dev at lists.llvm.org >> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev > > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Daniel Berlin via llvm-dev
2017-Jan-13 18:01 UTC
[llvm-dev] [RFC] IR-level Region Annotations
On Fri, Jan 13, 2017 at 9:00 AM, Tian, Xinmin via llvm-dev < llvm-dev at lists.llvm.org> wrote:> Mehdi, thanks for good questions. > > >>>>>Something isn’t clear to me about how do you preserve the validity of > the region annotations since regular passes don’t know about the attached > semantic? > > There are some small changes we have to make in some optimizations to make > sure the optimizations does not validation attached annotation semantics. > 1) provide hand-shaking / query utils for optimization to know the region > is parallel loop, 2) setup a proper optimization phase ordering. In our > product compiler ICC, we used both approaches. > >But this is very different than what you said earlier, becuase it's not minimal impact. Also, what you've proposed are very generic annotations, and what you are talking about here is a very specific set of ones, and their effects. If you are assuming these intrinsics will only be used to implement a specific set of annotations, with specific semantics, i'm probably with Reid on the "please use specific constructs" bandwagon. Otherwise, the regions could be and do anything, and the handshaking/querying has to be done everywhere, for everything, because you don't actually know what they are implementing. Maybe my region annotation is for "don't PRE things when they have exactly 73 predecessors" regions. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170113/16cc591d/attachment.html>
Tian, Xinmin via llvm-dev
2017-Jan-13 18:25 UTC
[llvm-dev] [RFC] IR-level Region Annotations
>>>>If you are assuming these intrinsics will only be used to implement a specific set of annotations, with specific semantics, i'm probably with Reid on the "please use specific constructs" bandwagon.I wouldn’t disagree on this part if these intrinsics end up with usages for a specific set of annotations. From: Daniel Berlin [mailto:dberlin at dberlin.org] Sent: Friday, January 13, 2017 10:01 AM To: Tian, Xinmin <xinmin.tian at intel.com> Cc: Mehdi Amini <mehdi.amini at apple.com>; Hal Finkel <hfinkel at anl.gov>; llvm-dev at lists.llvm.org Subject: Re: [llvm-dev] [RFC] IR-level Region Annotations On Fri, Jan 13, 2017 at 9:00 AM, Tian, Xinmin via llvm-dev <llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>> wrote: Mehdi, thanks for good questions.>>>>>Something isn’t clear to me about how do you preserve the validity of the region annotations since regular passes don’t know about the attached semantic?There are some small changes we have to make in some optimizations to make sure the optimizations does not validation attached annotation semantics. 1) provide hand-shaking / query utils for optimization to know the region is parallel loop, 2) setup a proper optimization phase ordering. In our product compiler ICC, we used both approaches. But this is very different than what you said earlier, becuase it's not minimal impact. Also, what you've proposed are very generic annotations, and what you are talking about here is a very specific set of ones, and their effects. If you are assuming these intrinsics will only be used to implement a specific set of annotations, with specific semantics, i'm probably with Reid on the "please use specific constructs" bandwagon. Otherwise, the regions could be and do anything, and the handshaking/querying has to be done everywhere, for everything, because you don't actually know what they are implementing. Maybe my region annotation is for "don't PRE things when they have exactly 73 predecessors" regions. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170113/3de109f1/attachment.html>