thr3ads.net - llvm dev - [llvm-dev] NVPTX codegen for llvm.sin (and friends) [Nov 2021]

If this information is useful, please help other people find it:
Share via:

Jon Chesterfield via llvm-dev

2021-Nov-17 19:40 UTC

[llvm-dev] NVPTX codegen for llvm.sin (and friends)

Thanks for the ping.

The IR pass that rewrote llvm.libm intrinsics to architecture specific ones
I wrote years ago was pretty trivial. I'm up for re-implementing that.

Essentially type out a (hash)table with entries like {llvm.sin.f64,
"sin",
__nv_sin, __ocml_sin} and do the substitution as a pass called
'ExpandLibmIntrinsics' or similar, run somewhere before instruction
selection for nvptx / amdgpu / other.

Could factor it differently if we don't like having the nv/oc names next to
each other, pass could take the corresponding lookup table as an argument.

Main benefit over the implemented-in-terms-of metadata approach is it's
trivial to implement and dead simple. Lowering in IR means doing it once
instead of once in sdag and once in gisel. I'll write the pass (from
scratch, annoyingly, as the last version I wrote is still closed source) if
people seem in favour.

Thanks all,

Jon

On Wed, Nov 17, 2021 at 7:20 PM Artem Belevich <tra at google.com> wrote:
> bump.
>
> On Tue, Sep 7, 2021 at 9:36 AM Artem Belevich <tra at google.com>
wrote:
>
>>
>> On Tue, Sep 7, 2021 at 9:15 AM Johannes Doerfert <
>> johannesdoerfert at gmail.com> wrote:
>>
>>> +bump
>>>
>>> Jon did respond positive to the proposal. I think the table
>>> implementation
>>> vs the "implemented_by" implementation is something we
can experiment
>>> with.
>>> I'm in favor of the latter as it is more general and can be
used in other
>>> places more easily, e.g., by providing source annotations. That
said,
>>> having
>>> the table version first would be a big step forward too.
>>>
>>> I'd say, if we hear some other positive voices towards this we
go ahead
>>> with
>>> patches on phab. After an end-to-end series is approved we merge it
>>> together.
>>>
>>
> I think we've got as much interest expressed (or not) as we can
reasonably
> expect for something that most back-ends do not care about.
> I vote for moving forward with the patches.
>
> --Artem
>
>
>
>>
>>> That said, people should chime in if they (dis)like the approach to
get
>>> math
>>> optimizations (and similar things) working on the GPU.
>>>
>>
>> I do like this approach for CUDA and NVPTX. I think HIP/AMDGPU may
>> benefit from it, too (+cc: yaxun.liu@).
>>
>> This will likely also be useful for things other than math functions.
>> E.g. it may come handy for sanitizer runtimes (+cc: eugenis@)  that
>> currently rely on LLVM *not* materializing libcalls they can't
provide when
>> they are building the runtime itself.
>>
>> --Artem
>>
>>
>>>
>>> ~ Johannes
>>>
>>>
>>> On 4/29/21 6:25 PM, Jon Chesterfield via llvm-dev wrote:
>>> >> Date: Wed, 28 Apr 2021 18:56:32 -0400
>>> >> From: William Moses via llvm-dev <llvm-dev at
lists.llvm.org>
>>> >> To: Artem Belevich <tra at google.com>
>>> >> ...
>>> >>
>>> >> Hi all,
>>> >>
>>> >> Reviving this thread as Johannes and I recently had some
time to take
>>> a
>>> >> look and do some additional design work. We'd love any
thoughts on the
>>> >> following proposal.
>>> >>
>>> > Keenly interested in this. Simplification (subjective) of the
metadata
>>> > proposal at the end. Some extra background info first though
as GPU
>>> libm is
>>> > a really interesting design space. When I did the bring up for
a
>>> different
>>> > architecture ~3 years ago, iirc I found the complete set:
>>> > - clang lowering libm (named functions) to intrinsics
>>> > - clang lowering intrinsic to libm functions
>>> > - optimisation passes that transform libm and ignore
intrinsics
>>> > - optimisation passes that transform intrinsics and ignore
libm
>>> > - selectiondag represents some intrinsics as nodes
>>> > - strength reduction, e.g. cos(double) -> cosf(float) under
fast-math
>>> >
>>> > I then wrote some more IR passes related to opencl-style
vectorisation
>>> and
>>> > some combines to fill in the gaps (which have not reached
upstream).
>>> So my
>>> > knowledge here is out of date but clang/llvm wasn't a
totally
>>> consistent
>>> > lowering framework back then.
>>> >
>>> > Cuda ships an IR library containing functions similar to libm.
ROCm
>>> does
>>> > something similar, also IR. We do an impedance matching scheme
in
>>> inline
>>> > headers which blocks various optimisations and poses some
challenges
>>> for
>>> > fortran.
>>> >
>>> >   *Background:*
>>> >
>>> >> ...
>>> >> While in theory we could define the lowering of these
intrinsics to
>>> be a
>>> >> table which looks up the correct __nv_sqrt, this would
require the
>>> >> definition of all such functions to remain or otherwise be
available.
>>> As
>>> >> it's undesirable for the LLVM backend to be aware of
CUDA paths, etc,
>>> this
>>> >> means that the original definitions brought in by merging
>>> libdevice.bc must
>>> >> be maintained. Currently these are deleted if they are
unused (as
>>> libdevice
>>> >> has them marked as internal).
>>> >>
>>> > The deleting is it's own hazard in the context of
fast-math, as the
>>> > function can be deleted, and then later an optimisation
creates a
>>> reference
>>> > to it, which doesn't link. It also prevents the backend
from (safely)
>>> > assuming the functions are available, which is moderately
annoying for
>>> > lowering some SDag ISD nodes.
>>> >
>>> >   2) GPU math functions aren't able to be optimized,
unlike standard
>>> math
>>> >
>>> >> functions.
>>> >>
>>> > This one is bad.
>>> >
>>> > *Design Constraints:*
>>> >> To remedy the problems described above we need a design
that meets the
>>> >> following:
>>> >> * Does not require modifying libdevice.bc or other code
shipped by a
>>> >> vendor-specific installation
>>> >> * Allows llvm math intrinsics to be lowered to
device-specific code
>>> >> * Keeps definitions of code used to implement intrinsics
until after
>>> all
>>> >> potential relevant intrinsics (including those created by
LLVM
>>> passes) have
>>> >> been lowered.
>>> >>
>>> > Yep, constraints sound right. Back ends can emit calls to
these
>>> functions
>>> > too, but I think nvptx/amdgcn do not. Perhaps they would like
to be
>>> able to
>>> > in places.
>>> >
>>> >   *Initial Design:*
>>> >
>>> >> ... metadata / aliases ...
>>> >>
>>> > Design would work, lets us continue with the header files we
have now.
>>> > Avoids some tedious programming, i.e. if we approached this as
the
>>> usual
>>> > back end lowering, where intrinsics / isd nodes are emitted as
named
>>> > function calls. That can be mostly driven by a table lookup as
the
>>> function
>>> > arity is limited. It is (i.e. was) quite tedious programming
that in
>>> ISel.
>>> > Doing basically the same thing for SDag + GIsel / ptx + gcn,
with
>>> > associated tests, is also unappealing.
>>> >
>>> > The set of functions near libm is small and known. We would
need to
>>> mark
>>> > 'sin' as 'implemented by' slightly different
functions for nvptx and
>>> > amdgcn, and some of them need thin wrapper code (e.g. modf in
amdgcn
>>> takes
>>> > an argument by pointer). It would be helpful for the fortran
runtime
>>> > libraries effort if the implementation didn't use inline
code in
>>> headers.
>>> >
>>> > There's very close to a 1:1 mapping between the two gpu
libraries, even
>>> > some extensions to libm exist in both. Therefore we could
write a
>>> table,
>>> > {llvm.sin.f64, "sin", __nv_sin, __ocml_sin},
>>> > with NULL or similar for functions that aren't available.
>>> >
>>> > A function level IR pass, called late in the pipeline, crawls
the call
>>> > instructions and rewrites based on simple rules and that
table. That
>>> is,
>>> > would rewrite a call to llvm.sin.f64 to a call to __ocml_sin.
Exactly
>>> the
>>> > same net effect as a header file containing metadata
annotations,
>>> except we
>>> > don't need the metadata machinery and we can use a single
trivial IR
>>> pass
>>> > for N architectures (by adding a column). Pass can do the odd
ugly
>>> thing
>>> > like impedance match function type easily enough.
>>> >
>>> > The other side of the problem - that functions once introduced
have to
>>> hang
>>> > around until we are sure they aren't needed - is the same
as in your
>>> > proposal. My preference would be to introduce the libdevice
functions
>>> > immediately after the lowering pass above, but we can inject
it early
>>> and
>>> > tag them to avoid erasure instead. Kind of need that to handle
the
>>> > cos->cosf transform anyway.
>>> >
>>> > Quite similar to the 'in theory ... table' suggestion,
which I like
>>> because
>>> > I remember it being far simpler than the sdag rewrite rules.
>>> >
>>> > Thanks!
>>> >
>>> > Jon
>>> >
>>> >
>>> > _______________________________________________
>>> > LLVM Developers mailing list
>>> > llvm-dev at lists.llvm.org
>>> > https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>
>>
>>
>> --
>> --Artem Belevich
>>
>
>
> --
> --Artem Belevich
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20211117/443d6e6e/attachment.html>

Artem Belevich via llvm-dev

2021-Nov-17 20:05 UTC

head link

[llvm-dev] NVPTX codegen for llvm.sin (and friends)

On Wed, Nov 17, 2021 at 11:40 AM Jon Chesterfield <
jonathanchesterfield at gmail.com> wrote:
> Thanks for the ping.
>
> The IR pass that rewrote llvm.libm intrinsics to architecture specific
> ones I wrote years ago was pretty trivial. I'm up for re-implementing
that.
>
> Essentially type out a (hash)table with entries like {llvm.sin.f64,
"sin",
> __nv_sin, __ocml_sin} and do the substitution as a pass called
> 'ExpandLibmIntrinsics' or similar, run somewhere before instruction
> selection for nvptx / amdgpu / other.
>
> Could factor it differently if we don't like having the nv/oc names
next
> to each other, pass could take the corresponding lookup table as an
> argument.
>
> Main benefit over the implemented-in-terms-of metadata approach is it's
> trivial to implement and dead simple. Lowering in IR means doing it once
> instead of once in sdag and once in gisel. I'll write the pass (from
> scratch, annoyingly, as the last version I wrote is still closed source) if
> people seem in favour.
>
SGTM.
Providing a fixed set of replacements for specific intrinsics is all
NVPTX needs now.
Expanding intrinsics late may miss some optimization opportunities,
so we may consider doing it earlier and/or more than once, in case we
happen to materialize new intrinsics in the later passes.

--Artem

>
> Thanks all,
>
> Jon
>
> On Wed, Nov 17, 2021 at 7:20 PM Artem Belevich <tra at google.com>
wrote:
>
>> bump.
>>
>> On Tue, Sep 7, 2021 at 9:36 AM Artem Belevich <tra at google.com>
wrote:
>>
>>>
>>> On Tue, Sep 7, 2021 at 9:15 AM Johannes Doerfert <
>>> johannesdoerfert at gmail.com> wrote:
>>>
>>>> +bump
>>>>
>>>> Jon did respond positive to the proposal. I think the table
>>>> implementation
>>>> vs the "implemented_by" implementation is something
we can experiment
>>>> with.
>>>> I'm in favor of the latter as it is more general and can be
used in
>>>> other
>>>> places more easily, e.g., by providing source annotations. That
said,
>>>> having
>>>> the table version first would be a big step forward too.
>>>>
>>>> I'd say, if we hear some other positive voices towards this
we go ahead
>>>> with
>>>> patches on phab. After an end-to-end series is approved we
merge it
>>>> together.
>>>>
>>>
>> I think we've got as much interest expressed (or not) as we can
>> reasonably expect for something that most back-ends do not care about.
>> I vote for moving forward with the patches.
>>
>> --Artem
>>
>>
>>
>>>
>>>> That said, people should chime in if they (dis)like the
approach to get
>>>> math
>>>> optimizations (and similar things) working on the GPU.
>>>>
>>>
>>> I do like this approach for CUDA and NVPTX. I think HIP/AMDGPU may
>>> benefit from it, too (+cc: yaxun.liu@).
>>>
>>> This will likely also be useful for things other than math
functions.
>>> E.g. it may come handy for sanitizer runtimes (+cc: eugenis@)  that
>>> currently rely on LLVM *not* materializing libcalls they can't
provide when
>>> they are building the runtime itself.
>>>
>>> --Artem
>>>
>>>
>>>>
>>>> ~ Johannes
>>>>
>>>>
>>>> On 4/29/21 6:25 PM, Jon Chesterfield via llvm-dev wrote:
>>>> >> Date: Wed, 28 Apr 2021 18:56:32 -0400
>>>> >> From: William Moses via llvm-dev <llvm-dev at
lists.llvm.org>
>>>> >> To: Artem Belevich <tra at google.com>
>>>> >> ...
>>>> >>
>>>> >> Hi all,
>>>> >>
>>>> >> Reviving this thread as Johannes and I recently had
some time to
>>>> take a
>>>> >> look and do some additional design work. We'd love
any thoughts on
>>>> the
>>>> >> following proposal.
>>>> >>
>>>> > Keenly interested in this. Simplification (subjective) of
the metadata
>>>> > proposal at the end. Some extra background info first
though as GPU
>>>> libm is
>>>> > a really interesting design space. When I did the bring up
for a
>>>> different
>>>> > architecture ~3 years ago, iirc I found the complete set:
>>>> > - clang lowering libm (named functions) to intrinsics
>>>> > - clang lowering intrinsic to libm functions
>>>> > - optimisation passes that transform libm and ignore
intrinsics
>>>> > - optimisation passes that transform intrinsics and ignore
libm
>>>> > - selectiondag represents some intrinsics as nodes
>>>> > - strength reduction, e.g. cos(double) -> cosf(float)
under fast-math
>>>> >
>>>> > I then wrote some more IR passes related to opencl-style
>>>> vectorisation and
>>>> > some combines to fill in the gaps (which have not reached
upstream).
>>>> So my
>>>> > knowledge here is out of date but clang/llvm wasn't a
totally
>>>> consistent
>>>> > lowering framework back then.
>>>> >
>>>> > Cuda ships an IR library containing functions similar to
libm. ROCm
>>>> does
>>>> > something similar, also IR. We do an impedance matching
scheme in
>>>> inline
>>>> > headers which blocks various optimisations and poses some
challenges
>>>> for
>>>> > fortran.
>>>> >
>>>> >   *Background:*
>>>> >
>>>> >> ...
>>>> >> While in theory we could define the lowering of these
intrinsics to
>>>> be a
>>>> >> table which looks up the correct __nv_sqrt, this would
require the
>>>> >> definition of all such functions to remain or
otherwise be
>>>> available. As
>>>> >> it's undesirable for the LLVM backend to be aware
of CUDA paths,
>>>> etc, this
>>>> >> means that the original definitions brought in by
merging
>>>> libdevice.bc must
>>>> >> be maintained. Currently these are deleted if they are
unused (as
>>>> libdevice
>>>> >> has them marked as internal).
>>>> >>
>>>> > The deleting is it's own hazard in the context of
fast-math, as the
>>>> > function can be deleted, and then later an optimisation
creates a
>>>> reference
>>>> > to it, which doesn't link. It also prevents the
backend from (safely)
>>>> > assuming the functions are available, which is moderately
annoying for
>>>> > lowering some SDag ISD nodes.
>>>> >
>>>> >   2) GPU math functions aren't able to be optimized,
unlike standard
>>>> math
>>>> >
>>>> >> functions.
>>>> >>
>>>> > This one is bad.
>>>> >
>>>> > *Design Constraints:*
>>>> >> To remedy the problems described above we need a
design that meets
>>>> the
>>>> >> following:
>>>> >> * Does not require modifying libdevice.bc or other
code shipped by a
>>>> >> vendor-specific installation
>>>> >> * Allows llvm math intrinsics to be lowered to
device-specific code
>>>> >> * Keeps definitions of code used to implement
intrinsics until after
>>>> all
>>>> >> potential relevant intrinsics (including those created
by LLVM
>>>> passes) have
>>>> >> been lowered.
>>>> >>
>>>> > Yep, constraints sound right. Back ends can emit calls to
these
>>>> functions
>>>> > too, but I think nvptx/amdgcn do not. Perhaps they would
like to be
>>>> able to
>>>> > in places.
>>>> >
>>>> >   *Initial Design:*
>>>> >
>>>> >> ... metadata / aliases ...
>>>> >>
>>>> > Design would work, lets us continue with the header files
we have now.
>>>> > Avoids some tedious programming, i.e. if we approached
this as the
>>>> usual
>>>> > back end lowering, where intrinsics / isd nodes are
emitted as named
>>>> > function calls. That can be mostly driven by a table
lookup as the
>>>> function
>>>> > arity is limited. It is (i.e. was) quite tedious
programming that in
>>>> ISel.
>>>> > Doing basically the same thing for SDag + GIsel / ptx +
gcn, with
>>>> > associated tests, is also unappealing.
>>>> >
>>>> > The set of functions near libm is small and known. We
would need to
>>>> mark
>>>> > 'sin' as 'implemented by' slightly
different functions for nvptx and
>>>> > amdgcn, and some of them need thin wrapper code (e.g. modf
in amdgcn
>>>> takes
>>>> > an argument by pointer). It would be helpful for the
fortran runtime
>>>> > libraries effort if the implementation didn't use
inline code in
>>>> headers.
>>>> >
>>>> > There's very close to a 1:1 mapping between the two
gpu libraries,
>>>> even
>>>> > some extensions to libm exist in both. Therefore we could
write a
>>>> table,
>>>> > {llvm.sin.f64, "sin", __nv_sin, __ocml_sin},
>>>> > with NULL or similar for functions that aren't
available.
>>>> >
>>>> > A function level IR pass, called late in the pipeline,
crawls the call
>>>> > instructions and rewrites based on simple rules and that
table. That
>>>> is,
>>>> > would rewrite a call to llvm.sin.f64 to a call to
__ocml_sin. Exactly
>>>> the
>>>> > same net effect as a header file containing metadata
annotations,
>>>> except we
>>>> > don't need the metadata machinery and we can use a
single trivial IR
>>>> pass
>>>> > for N architectures (by adding a column). Pass can do the
odd ugly
>>>> thing
>>>> > like impedance match function type easily enough.
>>>> >
>>>> > The other side of the problem - that functions once
introduced have
>>>> to hang
>>>> > around until we are sure they aren't needed - is the
same as in your
>>>> > proposal. My preference would be to introduce the
libdevice functions
>>>> > immediately after the lowering pass above, but we can
inject it early
>>>> and
>>>> > tag them to avoid erasure instead. Kind of need that to
handle the
>>>> > cos->cosf transform anyway.
>>>> >
>>>> > Quite similar to the 'in theory ... table'
suggestion, which I like
>>>> because
>>>> > I remember it being far simpler than the sdag rewrite
rules.
>>>> >
>>>> > Thanks!
>>>> >
>>>> > Jon
>>>> >
>>>> >
>>>> > _______________________________________________
>>>> > LLVM Developers mailing list
>>>> > llvm-dev at lists.llvm.org
>>>> > https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>
>>>
>>>
>>> --
>>> --Artem Belevich
>>>
>>
>>
>> --
>> --Artem Belevich
>>
>
-- 
--Artem Belevich
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20211117/73a7dc92/attachment.html>

llvm dev - Nov 2021 - NVPTX codegen for llvm.sin (and friends)

[llvm-dev] NVPTX codegen for llvm.sin (and friends)

[llvm-dev] NVPTX codegen for llvm.sin (and friends)