Artem Belevich via llvm-dev
2021-Nov-17 20:05 UTC
[llvm-dev] NVPTX codegen for llvm.sin (and friends)
On Wed, Nov 17, 2021 at 11:40 AM Jon Chesterfield < jonathanchesterfield at gmail.com> wrote:> Thanks for the ping. > > The IR pass that rewrote llvm.libm intrinsics to architecture specific > ones I wrote years ago was pretty trivial. I'm up for re-implementing that. > > Essentially type out a (hash)table with entries like {llvm.sin.f64, "sin", > __nv_sin, __ocml_sin} and do the substitution as a pass called > 'ExpandLibmIntrinsics' or similar, run somewhere before instruction > selection for nvptx / amdgpu / other. > > Could factor it differently if we don't like having the nv/oc names next > to each other, pass could take the corresponding lookup table as an > argument. > > Main benefit over the implemented-in-terms-of metadata approach is it's > trivial to implement and dead simple. Lowering in IR means doing it once > instead of once in sdag and once in gisel. I'll write the pass (from > scratch, annoyingly, as the last version I wrote is still closed source) if > people seem in favour. >SGTM. Providing a fixed set of replacements for specific intrinsics is all NVPTX needs now. Expanding intrinsics late may miss some optimization opportunities, so we may consider doing it earlier and/or more than once, in case we happen to materialize new intrinsics in the later passes. --Artem> > Thanks all, > > Jon > > On Wed, Nov 17, 2021 at 7:20 PM Artem Belevich <tra at google.com> wrote: > >> bump. >> >> On Tue, Sep 7, 2021 at 9:36 AM Artem Belevich <tra at google.com> wrote: >> >>> >>> On Tue, Sep 7, 2021 at 9:15 AM Johannes Doerfert < >>> johannesdoerfert at gmail.com> wrote: >>> >>>> +bump >>>> >>>> Jon did respond positive to the proposal. I think the table >>>> implementation >>>> vs the "implemented_by" implementation is something we can experiment >>>> with. >>>> I'm in favor of the latter as it is more general and can be used in >>>> other >>>> places more easily, e.g., by providing source annotations. That said, >>>> having >>>> the table version first would be a big step forward too. >>>> >>>> I'd say, if we hear some other positive voices towards this we go ahead >>>> with >>>> patches on phab. After an end-to-end series is approved we merge it >>>> together. >>>> >>> >> I think we've got as much interest expressed (or not) as we can >> reasonably expect for something that most back-ends do not care about. >> I vote for moving forward with the patches. >> >> --Artem >> >> >> >>> >>>> That said, people should chime in if they (dis)like the approach to get >>>> math >>>> optimizations (and similar things) working on the GPU. >>>> >>> >>> I do like this approach for CUDA and NVPTX. I think HIP/AMDGPU may >>> benefit from it, too (+cc: yaxun.liu@). >>> >>> This will likely also be useful for things other than math functions. >>> E.g. it may come handy for sanitizer runtimes (+cc: eugenis@) that >>> currently rely on LLVM *not* materializing libcalls they can't provide when >>> they are building the runtime itself. >>> >>> --Artem >>> >>> >>>> >>>> ~ Johannes >>>> >>>> >>>> On 4/29/21 6:25 PM, Jon Chesterfield via llvm-dev wrote: >>>> >> Date: Wed, 28 Apr 2021 18:56:32 -0400 >>>> >> From: William Moses via llvm-dev <llvm-dev at lists.llvm.org> >>>> >> To: Artem Belevich <tra at google.com> >>>> >> ... >>>> >> >>>> >> Hi all, >>>> >> >>>> >> Reviving this thread as Johannes and I recently had some time to >>>> take a >>>> >> look and do some additional design work. We'd love any thoughts on >>>> the >>>> >> following proposal. >>>> >> >>>> > Keenly interested in this. Simplification (subjective) of the metadata >>>> > proposal at the end. Some extra background info first though as GPU >>>> libm is >>>> > a really interesting design space. When I did the bring up for a >>>> different >>>> > architecture ~3 years ago, iirc I found the complete set: >>>> > - clang lowering libm (named functions) to intrinsics >>>> > - clang lowering intrinsic to libm functions >>>> > - optimisation passes that transform libm and ignore intrinsics >>>> > - optimisation passes that transform intrinsics and ignore libm >>>> > - selectiondag represents some intrinsics as nodes >>>> > - strength reduction, e.g. cos(double) -> cosf(float) under fast-math >>>> > >>>> > I then wrote some more IR passes related to opencl-style >>>> vectorisation and >>>> > some combines to fill in the gaps (which have not reached upstream). >>>> So my >>>> > knowledge here is out of date but clang/llvm wasn't a totally >>>> consistent >>>> > lowering framework back then. >>>> > >>>> > Cuda ships an IR library containing functions similar to libm. ROCm >>>> does >>>> > something similar, also IR. We do an impedance matching scheme in >>>> inline >>>> > headers which blocks various optimisations and poses some challenges >>>> for >>>> > fortran. >>>> > >>>> > *Background:* >>>> > >>>> >> ... >>>> >> While in theory we could define the lowering of these intrinsics to >>>> be a >>>> >> table which looks up the correct __nv_sqrt, this would require the >>>> >> definition of all such functions to remain or otherwise be >>>> available. As >>>> >> it's undesirable for the LLVM backend to be aware of CUDA paths, >>>> etc, this >>>> >> means that the original definitions brought in by merging >>>> libdevice.bc must >>>> >> be maintained. Currently these are deleted if they are unused (as >>>> libdevice >>>> >> has them marked as internal). >>>> >> >>>> > The deleting is it's own hazard in the context of fast-math, as the >>>> > function can be deleted, and then later an optimisation creates a >>>> reference >>>> > to it, which doesn't link. It also prevents the backend from (safely) >>>> > assuming the functions are available, which is moderately annoying for >>>> > lowering some SDag ISD nodes. >>>> > >>>> > 2) GPU math functions aren't able to be optimized, unlike standard >>>> math >>>> > >>>> >> functions. >>>> >> >>>> > This one is bad. >>>> > >>>> > *Design Constraints:* >>>> >> To remedy the problems described above we need a design that meets >>>> the >>>> >> following: >>>> >> * Does not require modifying libdevice.bc or other code shipped by a >>>> >> vendor-specific installation >>>> >> * Allows llvm math intrinsics to be lowered to device-specific code >>>> >> * Keeps definitions of code used to implement intrinsics until after >>>> all >>>> >> potential relevant intrinsics (including those created by LLVM >>>> passes) have >>>> >> been lowered. >>>> >> >>>> > Yep, constraints sound right. Back ends can emit calls to these >>>> functions >>>> > too, but I think nvptx/amdgcn do not. Perhaps they would like to be >>>> able to >>>> > in places. >>>> > >>>> > *Initial Design:* >>>> > >>>> >> ... metadata / aliases ... >>>> >> >>>> > Design would work, lets us continue with the header files we have now. >>>> > Avoids some tedious programming, i.e. if we approached this as the >>>> usual >>>> > back end lowering, where intrinsics / isd nodes are emitted as named >>>> > function calls. That can be mostly driven by a table lookup as the >>>> function >>>> > arity is limited. It is (i.e. was) quite tedious programming that in >>>> ISel. >>>> > Doing basically the same thing for SDag + GIsel / ptx + gcn, with >>>> > associated tests, is also unappealing. >>>> > >>>> > The set of functions near libm is small and known. We would need to >>>> mark >>>> > 'sin' as 'implemented by' slightly different functions for nvptx and >>>> > amdgcn, and some of them need thin wrapper code (e.g. modf in amdgcn >>>> takes >>>> > an argument by pointer). It would be helpful for the fortran runtime >>>> > libraries effort if the implementation didn't use inline code in >>>> headers. >>>> > >>>> > There's very close to a 1:1 mapping between the two gpu libraries, >>>> even >>>> > some extensions to libm exist in both. Therefore we could write a >>>> table, >>>> > {llvm.sin.f64, "sin", __nv_sin, __ocml_sin}, >>>> > with NULL or similar for functions that aren't available. >>>> > >>>> > A function level IR pass, called late in the pipeline, crawls the call >>>> > instructions and rewrites based on simple rules and that table. That >>>> is, >>>> > would rewrite a call to llvm.sin.f64 to a call to __ocml_sin. Exactly >>>> the >>>> > same net effect as a header file containing metadata annotations, >>>> except we >>>> > don't need the metadata machinery and we can use a single trivial IR >>>> pass >>>> > for N architectures (by adding a column). Pass can do the odd ugly >>>> thing >>>> > like impedance match function type easily enough. >>>> > >>>> > The other side of the problem - that functions once introduced have >>>> to hang >>>> > around until we are sure they aren't needed - is the same as in your >>>> > proposal. My preference would be to introduce the libdevice functions >>>> > immediately after the lowering pass above, but we can inject it early >>>> and >>>> > tag them to avoid erasure instead. Kind of need that to handle the >>>> > cos->cosf transform anyway. >>>> > >>>> > Quite similar to the 'in theory ... table' suggestion, which I like >>>> because >>>> > I remember it being far simpler than the sdag rewrite rules. >>>> > >>>> > Thanks! >>>> > >>>> > Jon >>>> > >>>> > >>>> > _______________________________________________ >>>> > LLVM Developers mailing list >>>> > llvm-dev at lists.llvm.org >>>> > https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >>>> >>> >>> >>> -- >>> --Artem Belevich >>> >> >> >> -- >> --Artem Belevich >> >-- --Artem Belevich -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20211117/73a7dc92/attachment.html>
Jon Chesterfield via llvm-dev
2021-Nov-17 20:17 UTC
[llvm-dev] NVPTX codegen for llvm.sin (and friends)
> > SGTM. > Providing a fixed set of replacements for specific intrinsics is all > NVPTX needs now. > Expanding intrinsics late may miss some optimization opportunities, > so we may consider doing it earlier and/or more than once, in case we > happen to materialize new intrinsics in the later passes. >Good old phase ordering. I don't think we've got any optimisations that target the nv/oc named functions and would personally prefer to never implement any. We do have ones that target llvm.libm, and some that target extern C functions with the same names as libm. There's some code in clang that converts some libm functions into llvm intrinsics, and I think some other code in clang that converts in the other direction. Maybe dependent on various math flags. So it seems we either canonicalise libm-like code and rearrange optimisations to work on the canonical form, or we write optimisations that know there are N names for essentially the same function. I'd prefer to go with the canonical form approach, e.g. we could rewrite calls to __nv_sin into calls to sin early on in the pipeline (or ignore them? seems likely applications call libm functions directly), and rewrite calls to sin to __nv_sin late on, with optimisations written against sin. Thanks! -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20211117/7680f18d/attachment.html>