Artem Belevich via llvm-dev
2021-Sep-07 16:36 UTC
[llvm-dev] NVPTX codegen for llvm.sin (and friends)
On Tue, Sep 7, 2021 at 9:15 AM Johannes Doerfert <johannesdoerfert at gmail.com> wrote:> +bump > > Jon did respond positive to the proposal. I think the table implementation > vs the "implemented_by" implementation is something we can experiment with. > I'm in favor of the latter as it is more general and can be used in other > places more easily, e.g., by providing source annotations. That said, > having > the table version first would be a big step forward too. > > I'd say, if we hear some other positive voices towards this we go ahead > with > patches on phab. After an end-to-end series is approved we merge it > together. > > That said, people should chime in if they (dis)like the approach to get > math > optimizations (and similar things) working on the GPU. >I do like this approach for CUDA and NVPTX. I think HIP/AMDGPU may benefit from it, too (+cc: yaxun.liu@). This will likely also be useful for things other than math functions. E.g. it may come handy for sanitizer runtimes (+cc: eugenis@) that currently rely on LLVM *not* materializing libcalls they can't provide when they are building the runtime itself. --Artem> > ~ Johannes > > > On 4/29/21 6:25 PM, Jon Chesterfield via llvm-dev wrote: > >> Date: Wed, 28 Apr 2021 18:56:32 -0400 > >> From: William Moses via llvm-dev <llvm-dev at lists.llvm.org> > >> To: Artem Belevich <tra at google.com> > >> ... > >> > >> Hi all, > >> > >> Reviving this thread as Johannes and I recently had some time to take a > >> look and do some additional design work. We'd love any thoughts on the > >> following proposal. > >> > > Keenly interested in this. Simplification (subjective) of the metadata > > proposal at the end. Some extra background info first though as GPU libm > is > > a really interesting design space. When I did the bring up for a > different > > architecture ~3 years ago, iirc I found the complete set: > > - clang lowering libm (named functions) to intrinsics > > - clang lowering intrinsic to libm functions > > - optimisation passes that transform libm and ignore intrinsics > > - optimisation passes that transform intrinsics and ignore libm > > - selectiondag represents some intrinsics as nodes > > - strength reduction, e.g. cos(double) -> cosf(float) under fast-math > > > > I then wrote some more IR passes related to opencl-style vectorisation > and > > some combines to fill in the gaps (which have not reached upstream). So > my > > knowledge here is out of date but clang/llvm wasn't a totally consistent > > lowering framework back then. > > > > Cuda ships an IR library containing functions similar to libm. ROCm does > > something similar, also IR. We do an impedance matching scheme in inline > > headers which blocks various optimisations and poses some challenges for > > fortran. > > > > *Background:* > > > >> ... > >> While in theory we could define the lowering of these intrinsics to be a > >> table which looks up the correct __nv_sqrt, this would require the > >> definition of all such functions to remain or otherwise be available. As > >> it's undesirable for the LLVM backend to be aware of CUDA paths, etc, > this > >> means that the original definitions brought in by merging libdevice.bc > must > >> be maintained. Currently these are deleted if they are unused (as > libdevice > >> has them marked as internal). > >> > > The deleting is it's own hazard in the context of fast-math, as the > > function can be deleted, and then later an optimisation creates a > reference > > to it, which doesn't link. It also prevents the backend from (safely) > > assuming the functions are available, which is moderately annoying for > > lowering some SDag ISD nodes. > > > > 2) GPU math functions aren't able to be optimized, unlike standard math > > > >> functions. > >> > > This one is bad. > > > > *Design Constraints:* > >> To remedy the problems described above we need a design that meets the > >> following: > >> * Does not require modifying libdevice.bc or other code shipped by a > >> vendor-specific installation > >> * Allows llvm math intrinsics to be lowered to device-specific code > >> * Keeps definitions of code used to implement intrinsics until after all > >> potential relevant intrinsics (including those created by LLVM passes) > have > >> been lowered. > >> > > Yep, constraints sound right. Back ends can emit calls to these functions > > too, but I think nvptx/amdgcn do not. Perhaps they would like to be able > to > > in places. > > > > *Initial Design:* > > > >> ... metadata / aliases ... > >> > > Design would work, lets us continue with the header files we have now. > > Avoids some tedious programming, i.e. if we approached this as the usual > > back end lowering, where intrinsics / isd nodes are emitted as named > > function calls. That can be mostly driven by a table lookup as the > function > > arity is limited. It is (i.e. was) quite tedious programming that in > ISel. > > Doing basically the same thing for SDag + GIsel / ptx + gcn, with > > associated tests, is also unappealing. > > > > The set of functions near libm is small and known. We would need to mark > > 'sin' as 'implemented by' slightly different functions for nvptx and > > amdgcn, and some of them need thin wrapper code (e.g. modf in amdgcn > takes > > an argument by pointer). It would be helpful for the fortran runtime > > libraries effort if the implementation didn't use inline code in headers. > > > > There's very close to a 1:1 mapping between the two gpu libraries, even > > some extensions to libm exist in both. Therefore we could write a table, > > {llvm.sin.f64, "sin", __nv_sin, __ocml_sin}, > > with NULL or similar for functions that aren't available. > > > > A function level IR pass, called late in the pipeline, crawls the call > > instructions and rewrites based on simple rules and that table. That is, > > would rewrite a call to llvm.sin.f64 to a call to __ocml_sin. Exactly the > > same net effect as a header file containing metadata annotations, except > we > > don't need the metadata machinery and we can use a single trivial IR pass > > for N architectures (by adding a column). Pass can do the odd ugly thing > > like impedance match function type easily enough. > > > > The other side of the problem - that functions once introduced have to > hang > > around until we are sure they aren't needed - is the same as in your > > proposal. My preference would be to introduce the libdevice functions > > immediately after the lowering pass above, but we can inject it early and > > tag them to avoid erasure instead. Kind of need that to handle the > > cos->cosf transform anyway. > > > > Quite similar to the 'in theory ... table' suggestion, which I like > because > > I remember it being far simpler than the sdag rewrite rules. > > > > Thanks! > > > > Jon > > > > > > _______________________________________________ > > LLVM Developers mailing list > > llvm-dev at lists.llvm.org > > https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >-- --Artem Belevich -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20210907/0ca27ae5/attachment.html>
Artem Belevich via llvm-dev
2021-Nov-17 19:20 UTC
[llvm-dev] NVPTX codegen for llvm.sin (and friends)
bump. On Tue, Sep 7, 2021 at 9:36 AM Artem Belevich <tra at google.com> wrote:> > On Tue, Sep 7, 2021 at 9:15 AM Johannes Doerfert < > johannesdoerfert at gmail.com> wrote: > >> +bump >> >> Jon did respond positive to the proposal. I think the table implementation >> vs the "implemented_by" implementation is something we can experiment >> with. >> I'm in favor of the latter as it is more general and can be used in other >> places more easily, e.g., by providing source annotations. That said, >> having >> the table version first would be a big step forward too. >> >> I'd say, if we hear some other positive voices towards this we go ahead >> with >> patches on phab. After an end-to-end series is approved we merge it >> together. >> >I think we've got as much interest expressed (or not) as we can reasonably expect for something that most back-ends do not care about. I vote for moving forward with the patches. --Artem> >> That said, people should chime in if they (dis)like the approach to get >> math >> optimizations (and similar things) working on the GPU. >> > > I do like this approach for CUDA and NVPTX. I think HIP/AMDGPU may benefit > from it, too (+cc: yaxun.liu@). > > This will likely also be useful for things other than math functions. > E.g. it may come handy for sanitizer runtimes (+cc: eugenis@) that > currently rely on LLVM *not* materializing libcalls they can't provide when > they are building the runtime itself. > > --Artem > > >> >> ~ Johannes >> >> >> On 4/29/21 6:25 PM, Jon Chesterfield via llvm-dev wrote: >> >> Date: Wed, 28 Apr 2021 18:56:32 -0400 >> >> From: William Moses via llvm-dev <llvm-dev at lists.llvm.org> >> >> To: Artem Belevich <tra at google.com> >> >> ... >> >> >> >> Hi all, >> >> >> >> Reviving this thread as Johannes and I recently had some time to take a >> >> look and do some additional design work. We'd love any thoughts on the >> >> following proposal. >> >> >> > Keenly interested in this. Simplification (subjective) of the metadata >> > proposal at the end. Some extra background info first though as GPU >> libm is >> > a really interesting design space. When I did the bring up for a >> different >> > architecture ~3 years ago, iirc I found the complete set: >> > - clang lowering libm (named functions) to intrinsics >> > - clang lowering intrinsic to libm functions >> > - optimisation passes that transform libm and ignore intrinsics >> > - optimisation passes that transform intrinsics and ignore libm >> > - selectiondag represents some intrinsics as nodes >> > - strength reduction, e.g. cos(double) -> cosf(float) under fast-math >> > >> > I then wrote some more IR passes related to opencl-style vectorisation >> and >> > some combines to fill in the gaps (which have not reached upstream). So >> my >> > knowledge here is out of date but clang/llvm wasn't a totally consistent >> > lowering framework back then. >> > >> > Cuda ships an IR library containing functions similar to libm. ROCm does >> > something similar, also IR. We do an impedance matching scheme in inline >> > headers which blocks various optimisations and poses some challenges for >> > fortran. >> > >> > *Background:* >> > >> >> ... >> >> While in theory we could define the lowering of these intrinsics to be >> a >> >> table which looks up the correct __nv_sqrt, this would require the >> >> definition of all such functions to remain or otherwise be available. >> As >> >> it's undesirable for the LLVM backend to be aware of CUDA paths, etc, >> this >> >> means that the original definitions brought in by merging libdevice.bc >> must >> >> be maintained. Currently these are deleted if they are unused (as >> libdevice >> >> has them marked as internal). >> >> >> > The deleting is it's own hazard in the context of fast-math, as the >> > function can be deleted, and then later an optimisation creates a >> reference >> > to it, which doesn't link. It also prevents the backend from (safely) >> > assuming the functions are available, which is moderately annoying for >> > lowering some SDag ISD nodes. >> > >> > 2) GPU math functions aren't able to be optimized, unlike standard >> math >> > >> >> functions. >> >> >> > This one is bad. >> > >> > *Design Constraints:* >> >> To remedy the problems described above we need a design that meets the >> >> following: >> >> * Does not require modifying libdevice.bc or other code shipped by a >> >> vendor-specific installation >> >> * Allows llvm math intrinsics to be lowered to device-specific code >> >> * Keeps definitions of code used to implement intrinsics until after >> all >> >> potential relevant intrinsics (including those created by LLVM passes) >> have >> >> been lowered. >> >> >> > Yep, constraints sound right. Back ends can emit calls to these >> functions >> > too, but I think nvptx/amdgcn do not. Perhaps they would like to be >> able to >> > in places. >> > >> > *Initial Design:* >> > >> >> ... metadata / aliases ... >> >> >> > Design would work, lets us continue with the header files we have now. >> > Avoids some tedious programming, i.e. if we approached this as the usual >> > back end lowering, where intrinsics / isd nodes are emitted as named >> > function calls. That can be mostly driven by a table lookup as the >> function >> > arity is limited. It is (i.e. was) quite tedious programming that in >> ISel. >> > Doing basically the same thing for SDag + GIsel / ptx + gcn, with >> > associated tests, is also unappealing. >> > >> > The set of functions near libm is small and known. We would need to mark >> > 'sin' as 'implemented by' slightly different functions for nvptx and >> > amdgcn, and some of them need thin wrapper code (e.g. modf in amdgcn >> takes >> > an argument by pointer). It would be helpful for the fortran runtime >> > libraries effort if the implementation didn't use inline code in >> headers. >> > >> > There's very close to a 1:1 mapping between the two gpu libraries, even >> > some extensions to libm exist in both. Therefore we could write a table, >> > {llvm.sin.f64, "sin", __nv_sin, __ocml_sin}, >> > with NULL or similar for functions that aren't available. >> > >> > A function level IR pass, called late in the pipeline, crawls the call >> > instructions and rewrites based on simple rules and that table. That is, >> > would rewrite a call to llvm.sin.f64 to a call to __ocml_sin. Exactly >> the >> > same net effect as a header file containing metadata annotations, >> except we >> > don't need the metadata machinery and we can use a single trivial IR >> pass >> > for N architectures (by adding a column). Pass can do the odd ugly thing >> > like impedance match function type easily enough. >> > >> > The other side of the problem - that functions once introduced have to >> hang >> > around until we are sure they aren't needed - is the same as in your >> > proposal. My preference would be to introduce the libdevice functions >> > immediately after the lowering pass above, but we can inject it early >> and >> > tag them to avoid erasure instead. Kind of need that to handle the >> > cos->cosf transform anyway. >> > >> > Quite similar to the 'in theory ... table' suggestion, which I like >> because >> > I remember it being far simpler than the sdag rewrite rules. >> > >> > Thanks! >> > >> > Jon >> > >> > >> > _______________________________________________ >> > LLVM Developers mailing list >> > llvm-dev at lists.llvm.org >> > https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >> > > > -- > --Artem Belevich >-- --Artem Belevich -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20211117/b881e906/attachment.html>