thr3ads.net - llvm dev - [llvm-dev] NVPTX codegen for llvm.sin (and friends) [Nov 2021]

If this information is useful, please help other people find it:
Share via:

Artem Belevich via llvm-dev

2021-Nov-17 20:05 UTC

[llvm-dev] NVPTX codegen for llvm.sin (and friends)

On Wed, Nov 17, 2021 at 11:40 AM Jon Chesterfield <
jonathanchesterfield at gmail.com> wrote:
> Thanks for the ping.
>
> The IR pass that rewrote llvm.libm intrinsics to architecture specific
> ones I wrote years ago was pretty trivial. I'm up for re-implementing
that.
>
> Essentially type out a (hash)table with entries like {llvm.sin.f64,
"sin",
> __nv_sin, __ocml_sin} and do the substitution as a pass called
> 'ExpandLibmIntrinsics' or similar, run somewhere before instruction
> selection for nvptx / amdgpu / other.
>
> Could factor it differently if we don't like having the nv/oc names
next
> to each other, pass could take the corresponding lookup table as an
> argument.
>
> Main benefit over the implemented-in-terms-of metadata approach is it's
> trivial to implement and dead simple. Lowering in IR means doing it once
> instead of once in sdag and once in gisel. I'll write the pass (from
> scratch, annoyingly, as the last version I wrote is still closed source) if
> people seem in favour.
>
SGTM.
Providing a fixed set of replacements for specific intrinsics is all
NVPTX needs now.
Expanding intrinsics late may miss some optimization opportunities,
so we may consider doing it earlier and/or more than once, in case we
happen to materialize new intrinsics in the later passes.

--Artem

>
> Thanks all,
>
> Jon
>
> On Wed, Nov 17, 2021 at 7:20 PM Artem Belevich <tra at google.com>
wrote:
>
>> bump.
>>
>> On Tue, Sep 7, 2021 at 9:36 AM Artem Belevich <tra at google.com>
wrote:
>>
>>>
>>> On Tue, Sep 7, 2021 at 9:15 AM Johannes Doerfert <
>>> johannesdoerfert at gmail.com> wrote:
>>>
>>>> +bump
>>>>
>>>> Jon did respond positive to the proposal. I think the table
>>>> implementation
>>>> vs the "implemented_by" implementation is something
we can experiment
>>>> with.
>>>> I'm in favor of the latter as it is more general and can be
used in
>>>> other
>>>> places more easily, e.g., by providing source annotations. That
said,
>>>> having
>>>> the table version first would be a big step forward too.
>>>>
>>>> I'd say, if we hear some other positive voices towards this
we go ahead
>>>> with
>>>> patches on phab. After an end-to-end series is approved we
merge it
>>>> together.
>>>>
>>>
>> I think we've got as much interest expressed (or not) as we can
>> reasonably expect for something that most back-ends do not care about.
>> I vote for moving forward with the patches.
>>
>> --Artem
>>
>>
>>
>>>
>>>> That said, people should chime in if they (dis)like the
approach to get
>>>> math
>>>> optimizations (and similar things) working on the GPU.
>>>>
>>>
>>> I do like this approach for CUDA and NVPTX. I think HIP/AMDGPU may
>>> benefit from it, too (+cc: yaxun.liu@).
>>>
>>> This will likely also be useful for things other than math
functions.
>>> E.g. it may come handy for sanitizer runtimes (+cc: eugenis@)  that
>>> currently rely on LLVM *not* materializing libcalls they can't
provide when
>>> they are building the runtime itself.
>>>
>>> --Artem
>>>
>>>
>>>>
>>>> ~ Johannes
>>>>
>>>>
>>>> On 4/29/21 6:25 PM, Jon Chesterfield via llvm-dev wrote:
>>>> >> Date: Wed, 28 Apr 2021 18:56:32 -0400
>>>> >> From: William Moses via llvm-dev <llvm-dev at
lists.llvm.org>
>>>> >> To: Artem Belevich <tra at google.com>
>>>> >> ...
>>>> >>
>>>> >> Hi all,
>>>> >>
>>>> >> Reviving this thread as Johannes and I recently had
some time to
>>>> take a
>>>> >> look and do some additional design work. We'd love
any thoughts on
>>>> the
>>>> >> following proposal.
>>>> >>
>>>> > Keenly interested in this. Simplification (subjective) of
the metadata
>>>> > proposal at the end. Some extra background info first
though as GPU
>>>> libm is
>>>> > a really interesting design space. When I did the bring up
for a
>>>> different
>>>> > architecture ~3 years ago, iirc I found the complete set:
>>>> > - clang lowering libm (named functions) to intrinsics
>>>> > - clang lowering intrinsic to libm functions
>>>> > - optimisation passes that transform libm and ignore
intrinsics
>>>> > - optimisation passes that transform intrinsics and ignore
libm
>>>> > - selectiondag represents some intrinsics as nodes
>>>> > - strength reduction, e.g. cos(double) -> cosf(float)
under fast-math
>>>> >
>>>> > I then wrote some more IR passes related to opencl-style
>>>> vectorisation and
>>>> > some combines to fill in the gaps (which have not reached
upstream).
>>>> So my
>>>> > knowledge here is out of date but clang/llvm wasn't a
totally
>>>> consistent
>>>> > lowering framework back then.
>>>> >
>>>> > Cuda ships an IR library containing functions similar to
libm. ROCm
>>>> does
>>>> > something similar, also IR. We do an impedance matching
scheme in
>>>> inline
>>>> > headers which blocks various optimisations and poses some
challenges
>>>> for
>>>> > fortran.
>>>> >
>>>> >   *Background:*
>>>> >
>>>> >> ...
>>>> >> While in theory we could define the lowering of these
intrinsics to
>>>> be a
>>>> >> table which looks up the correct __nv_sqrt, this would
require the
>>>> >> definition of all such functions to remain or
otherwise be
>>>> available. As
>>>> >> it's undesirable for the LLVM backend to be aware
of CUDA paths,
>>>> etc, this
>>>> >> means that the original definitions brought in by
merging
>>>> libdevice.bc must
>>>> >> be maintained. Currently these are deleted if they are
unused (as
>>>> libdevice
>>>> >> has them marked as internal).
>>>> >>
>>>> > The deleting is it's own hazard in the context of
fast-math, as the
>>>> > function can be deleted, and then later an optimisation
creates a
>>>> reference
>>>> > to it, which doesn't link. It also prevents the
backend from (safely)
>>>> > assuming the functions are available, which is moderately
annoying for
>>>> > lowering some SDag ISD nodes.
>>>> >
>>>> >   2) GPU math functions aren't able to be optimized,
unlike standard
>>>> math
>>>> >
>>>> >> functions.
>>>> >>
>>>> > This one is bad.
>>>> >
>>>> > *Design Constraints:*
>>>> >> To remedy the problems described above we need a
design that meets
>>>> the
>>>> >> following:
>>>> >> * Does not require modifying libdevice.bc or other
code shipped by a
>>>> >> vendor-specific installation
>>>> >> * Allows llvm math intrinsics to be lowered to
device-specific code
>>>> >> * Keeps definitions of code used to implement
intrinsics until after
>>>> all
>>>> >> potential relevant intrinsics (including those created
by LLVM
>>>> passes) have
>>>> >> been lowered.
>>>> >>
>>>> > Yep, constraints sound right. Back ends can emit calls to
these
>>>> functions
>>>> > too, but I think nvptx/amdgcn do not. Perhaps they would
like to be
>>>> able to
>>>> > in places.
>>>> >
>>>> >   *Initial Design:*
>>>> >
>>>> >> ... metadata / aliases ...
>>>> >>
>>>> > Design would work, lets us continue with the header files
we have now.
>>>> > Avoids some tedious programming, i.e. if we approached
this as the
>>>> usual
>>>> > back end lowering, where intrinsics / isd nodes are
emitted as named
>>>> > function calls. That can be mostly driven by a table
lookup as the
>>>> function
>>>> > arity is limited. It is (i.e. was) quite tedious
programming that in
>>>> ISel.
>>>> > Doing basically the same thing for SDag + GIsel / ptx +
gcn, with
>>>> > associated tests, is also unappealing.
>>>> >
>>>> > The set of functions near libm is small and known. We
would need to
>>>> mark
>>>> > 'sin' as 'implemented by' slightly
different functions for nvptx and
>>>> > amdgcn, and some of them need thin wrapper code (e.g. modf
in amdgcn
>>>> takes
>>>> > an argument by pointer). It would be helpful for the
fortran runtime
>>>> > libraries effort if the implementation didn't use
inline code in
>>>> headers.
>>>> >
>>>> > There's very close to a 1:1 mapping between the two
gpu libraries,
>>>> even
>>>> > some extensions to libm exist in both. Therefore we could
write a
>>>> table,
>>>> > {llvm.sin.f64, "sin", __nv_sin, __ocml_sin},
>>>> > with NULL or similar for functions that aren't
available.
>>>> >
>>>> > A function level IR pass, called late in the pipeline,
crawls the call
>>>> > instructions and rewrites based on simple rules and that
table. That
>>>> is,
>>>> > would rewrite a call to llvm.sin.f64 to a call to
__ocml_sin. Exactly
>>>> the
>>>> > same net effect as a header file containing metadata
annotations,
>>>> except we
>>>> > don't need the metadata machinery and we can use a
single trivial IR
>>>> pass
>>>> > for N architectures (by adding a column). Pass can do the
odd ugly
>>>> thing
>>>> > like impedance match function type easily enough.
>>>> >
>>>> > The other side of the problem - that functions once
introduced have
>>>> to hang
>>>> > around until we are sure they aren't needed - is the
same as in your
>>>> > proposal. My preference would be to introduce the
libdevice functions
>>>> > immediately after the lowering pass above, but we can
inject it early
>>>> and
>>>> > tag them to avoid erasure instead. Kind of need that to
handle the
>>>> > cos->cosf transform anyway.
>>>> >
>>>> > Quite similar to the 'in theory ... table'
suggestion, which I like
>>>> because
>>>> > I remember it being far simpler than the sdag rewrite
rules.
>>>> >
>>>> > Thanks!
>>>> >
>>>> > Jon
>>>> >
>>>> >
>>>> > _______________________________________________
>>>> > LLVM Developers mailing list
>>>> > llvm-dev at lists.llvm.org
>>>> > https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>
>>>
>>>
>>> --
>>> --Artem Belevich
>>>
>>
>>
>> --
>> --Artem Belevich
>>
>
-- 
--Artem Belevich
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20211117/73a7dc92/attachment.html>

Jon Chesterfield via llvm-dev

2021-Nov-17 20:17 UTC

head link

[llvm-dev] NVPTX codegen for llvm.sin (and friends)

>
> SGTM.
> Providing a fixed set of replacements for specific intrinsics is all
> NVPTX needs now.
> Expanding intrinsics late may miss some optimization opportunities,
> so we may consider doing it earlier and/or more than once, in case we
> happen to materialize new intrinsics in the later passes.
>
Good old phase ordering. I don't think we've got any optimisations that
target the nv/oc named functions and would personally prefer to never
implement any.

We do have ones that target llvm.libm, and some that target extern C
functions with the same names as libm. There's some code in clang that
converts some libm functions into llvm intrinsics, and I think some other
code in clang that converts in the other direction. Maybe dependent on
various math flags.

So it seems we either canonicalise libm-like code and rearrange
optimisations to work on the canonical form, or we write optimisations that
know there are N names for essentially the same function. I'd prefer to go
with the canonical form approach, e.g. we could rewrite calls to __nv_sin
into calls to sin early on in the pipeline (or ignore them? seems likely
applications call libm functions directly), and rewrite calls to sin to
__nv_sin late on, with optimisations written against sin.

Thanks!
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20211117/7680f18d/attachment.html>

llvm dev - Nov 2021 - NVPTX codegen for llvm.sin (and friends)

[llvm-dev] NVPTX codegen for llvm.sin (and friends)

[llvm-dev] NVPTX codegen for llvm.sin (and friends)