thr3ads.net - llvm dev - [llvm-dev] NVPTX codegen for llvm.sin (and friends) [Nov 2021]

If this information is useful, please help other people find it:
Share via:

Artem Belevich via llvm-dev

2021-Sep-07 16:36 UTC

[llvm-dev] NVPTX codegen for llvm.sin (and friends)

On Tue, Sep 7, 2021 at 9:15 AM Johannes Doerfert <johannesdoerfert at
gmail.com>
wrote:
> +bump
>
> Jon did respond positive to the proposal. I think the table implementation
> vs the "implemented_by" implementation is something we can
experiment with.
> I'm in favor of the latter as it is more general and can be used in
other
> places more easily, e.g., by providing source annotations. That said,
> having
> the table version first would be a big step forward too.
>
> I'd say, if we hear some other positive voices towards this we go ahead
> with
> patches on phab. After an end-to-end series is approved we merge it
> together.
>
> That said, people should chime in if they (dis)like the approach to get
> math
> optimizations (and similar things) working on the GPU.
>
I do like this approach for CUDA and NVPTX. I think HIP/AMDGPU may benefit
from it, too (+cc: yaxun.liu@).

This will likely also be useful for things other than math functions.
E.g. it may come handy for sanitizer runtimes (+cc: eugenis@)  that
currently rely on LLVM *not* materializing libcalls they can't provide when
they are building the runtime itself.

--Artem

>
> ~ Johannes
>
>
> On 4/29/21 6:25 PM, Jon Chesterfield via llvm-dev wrote:
> >> Date: Wed, 28 Apr 2021 18:56:32 -0400
> >> From: William Moses via llvm-dev <llvm-dev at
lists.llvm.org>
> >> To: Artem Belevich <tra at google.com>
> >> ...
> >>
> >> Hi all,
> >>
> >> Reviving this thread as Johannes and I recently had some time to
take a
> >> look and do some additional design work. We'd love any
thoughts on the
> >> following proposal.
> >>
> > Keenly interested in this. Simplification (subjective) of the metadata
> > proposal at the end. Some extra background info first though as GPU
libm
> is
> > a really interesting design space. When I did the bring up for a
> different
> > architecture ~3 years ago, iirc I found the complete set:
> > - clang lowering libm (named functions) to intrinsics
> > - clang lowering intrinsic to libm functions
> > - optimisation passes that transform libm and ignore intrinsics
> > - optimisation passes that transform intrinsics and ignore libm
> > - selectiondag represents some intrinsics as nodes
> > - strength reduction, e.g. cos(double) -> cosf(float) under
fast-math
> >
> > I then wrote some more IR passes related to opencl-style vectorisation
> and
> > some combines to fill in the gaps (which have not reached upstream).
So
> my
> > knowledge here is out of date but clang/llvm wasn't a totally
consistent
> > lowering framework back then.
> >
> > Cuda ships an IR library containing functions similar to libm. ROCm
does
> > something similar, also IR. We do an impedance matching scheme in
inline
> > headers which blocks various optimisations and poses some challenges
for
> > fortran.
> >
> >   *Background:*
> >
> >> ...
> >> While in theory we could define the lowering of these intrinsics
to be a
> >> table which looks up the correct __nv_sqrt, this would require the
> >> definition of all such functions to remain or otherwise be
available. As
> >> it's undesirable for the LLVM backend to be aware of CUDA
paths, etc,
> this
> >> means that the original definitions brought in by merging
libdevice.bc
> must
> >> be maintained. Currently these are deleted if they are unused (as
> libdevice
> >> has them marked as internal).
> >>
> > The deleting is it's own hazard in the context of fast-math, as
the
> > function can be deleted, and then later an optimisation creates a
> reference
> > to it, which doesn't link. It also prevents the backend from
(safely)
> > assuming the functions are available, which is moderately annoying for
> > lowering some SDag ISD nodes.
> >
> >   2) GPU math functions aren't able to be optimized, unlike
standard math
> >
> >> functions.
> >>
> > This one is bad.
> >
> > *Design Constraints:*
> >> To remedy the problems described above we need a design that meets
the
> >> following:
> >> * Does not require modifying libdevice.bc or other code shipped by
a
> >> vendor-specific installation
> >> * Allows llvm math intrinsics to be lowered to device-specific
code
> >> * Keeps definitions of code used to implement intrinsics until
after all
> >> potential relevant intrinsics (including those created by LLVM
passes)
> have
> >> been lowered.
> >>
> > Yep, constraints sound right. Back ends can emit calls to these
functions
> > too, but I think nvptx/amdgcn do not. Perhaps they would like to be
able
> to
> > in places.
> >
> >   *Initial Design:*
> >
> >> ... metadata / aliases ...
> >>
> > Design would work, lets us continue with the header files we have now.
> > Avoids some tedious programming, i.e. if we approached this as the
usual
> > back end lowering, where intrinsics / isd nodes are emitted as named
> > function calls. That can be mostly driven by a table lookup as the
> function
> > arity is limited. It is (i.e. was) quite tedious programming that in
> ISel.
> > Doing basically the same thing for SDag + GIsel / ptx + gcn, with
> > associated tests, is also unappealing.
> >
> > The set of functions near libm is small and known. We would need to
mark
> > 'sin' as 'implemented by' slightly different functions
for nvptx and
> > amdgcn, and some of them need thin wrapper code (e.g. modf in amdgcn
> takes
> > an argument by pointer). It would be helpful for the fortran runtime
> > libraries effort if the implementation didn't use inline code in
headers.
> >
> > There's very close to a 1:1 mapping between the two gpu libraries,
even
> > some extensions to libm exist in both. Therefore we could write a
table,
> > {llvm.sin.f64, "sin", __nv_sin, __ocml_sin},
> > with NULL or similar for functions that aren't available.
> >
> > A function level IR pass, called late in the pipeline, crawls the call
> > instructions and rewrites based on simple rules and that table. That
is,
> > would rewrite a call to llvm.sin.f64 to a call to __ocml_sin. Exactly
the
> > same net effect as a header file containing metadata annotations,
except
> we
> > don't need the metadata machinery and we can use a single trivial
IR pass
> > for N architectures (by adding a column). Pass can do the odd ugly
thing
> > like impedance match function type easily enough.
> >
> > The other side of the problem - that functions once introduced have to
> hang
> > around until we are sure they aren't needed - is the same as in
your
> > proposal. My preference would be to introduce the libdevice functions
> > immediately after the lowering pass above, but we can inject it early
and
> > tag them to avoid erasure instead. Kind of need that to handle the
> > cos->cosf transform anyway.
> >
> > Quite similar to the 'in theory ... table' suggestion, which I
like
> because
> > I remember it being far simpler than the sdag rewrite rules.
> >
> > Thanks!
> >
> > Jon
> >
> >
> > _______________________________________________
> > LLVM Developers mailing list
> > llvm-dev at lists.llvm.org
> > https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>

-- 
--Artem Belevich
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20210907/0ca27ae5/attachment.html>

Artem Belevich via llvm-dev

2021-Nov-17 19:20 UTC

head link

[llvm-dev] NVPTX codegen for llvm.sin (and friends)

bump.

On Tue, Sep 7, 2021 at 9:36 AM Artem Belevich <tra at google.com> wrote:
>
> On Tue, Sep 7, 2021 at 9:15 AM Johannes Doerfert <
> johannesdoerfert at gmail.com> wrote:
>
>> +bump
>>
>> Jon did respond positive to the proposal. I think the table
implementation
>> vs the "implemented_by" implementation is something we can
experiment
>> with.
>> I'm in favor of the latter as it is more general and can be used in
other
>> places more easily, e.g., by providing source annotations. That said,
>> having
>> the table version first would be a big step forward too.
>>
>> I'd say, if we hear some other positive voices towards this we go
ahead
>> with
>> patches on phab. After an end-to-end series is approved we merge it
>> together.
>>
>I think we've got as much interest expressed (or not) as we can reasonably
expect for something that most back-ends do not care about.
I vote for moving forward with the patches.

--Artem


>
>> That said, people should chime in if they (dis)like the approach to get
>> math
>> optimizations (and similar things) working on the GPU.
>>
>
> I do like this approach for CUDA and NVPTX. I think HIP/AMDGPU may benefit
> from it, too (+cc: yaxun.liu@).
>
> This will likely also be useful for things other than math functions.
> E.g. it may come handy for sanitizer runtimes (+cc: eugenis@)  that
> currently rely on LLVM *not* materializing libcalls they can't provide
when
> they are building the runtime itself.
>
> --Artem
>
>
>>
>> ~ Johannes
>>
>>
>> On 4/29/21 6:25 PM, Jon Chesterfield via llvm-dev wrote:
>> >> Date: Wed, 28 Apr 2021 18:56:32 -0400
>> >> From: William Moses via llvm-dev <llvm-dev at
lists.llvm.org>
>> >> To: Artem Belevich <tra at google.com>
>> >> ...
>> >>
>> >> Hi all,
>> >>
>> >> Reviving this thread as Johannes and I recently had some time
to take a
>> >> look and do some additional design work. We'd love any
thoughts on the
>> >> following proposal.
>> >>
>> > Keenly interested in this. Simplification (subjective) of the
metadata
>> > proposal at the end. Some extra background info first though as
GPU
>> libm is
>> > a really interesting design space. When I did the bring up for a
>> different
>> > architecture ~3 years ago, iirc I found the complete set:
>> > - clang lowering libm (named functions) to intrinsics
>> > - clang lowering intrinsic to libm functions
>> > - optimisation passes that transform libm and ignore intrinsics
>> > - optimisation passes that transform intrinsics and ignore libm
>> > - selectiondag represents some intrinsics as nodes
>> > - strength reduction, e.g. cos(double) -> cosf(float) under
fast-math
>> >
>> > I then wrote some more IR passes related to opencl-style
vectorisation
>> and
>> > some combines to fill in the gaps (which have not reached
upstream). So
>> my
>> > knowledge here is out of date but clang/llvm wasn't a totally
consistent
>> > lowering framework back then.
>> >
>> > Cuda ships an IR library containing functions similar to libm.
ROCm does
>> > something similar, also IR. We do an impedance matching scheme in
inline
>> > headers which blocks various optimisations and poses some
challenges for
>> > fortran.
>> >
>> >   *Background:*
>> >
>> >> ...
>> >> While in theory we could define the lowering of these
intrinsics to be
>> a
>> >> table which looks up the correct __nv_sqrt, this would require
the
>> >> definition of all such functions to remain or otherwise be
available.
>> As
>> >> it's undesirable for the LLVM backend to be aware of CUDA
paths, etc,
>> this
>> >> means that the original definitions brought in by merging
libdevice.bc
>> must
>> >> be maintained. Currently these are deleted if they are unused
(as
>> libdevice
>> >> has them marked as internal).
>> >>
>> > The deleting is it's own hazard in the context of fast-math,
as the
>> > function can be deleted, and then later an optimisation creates a
>> reference
>> > to it, which doesn't link. It also prevents the backend from
(safely)
>> > assuming the functions are available, which is moderately annoying
for
>> > lowering some SDag ISD nodes.
>> >
>> >   2) GPU math functions aren't able to be optimized, unlike
standard
>> math
>> >
>> >> functions.
>> >>
>> > This one is bad.
>> >
>> > *Design Constraints:*
>> >> To remedy the problems described above we need a design that
meets the
>> >> following:
>> >> * Does not require modifying libdevice.bc or other code
shipped by a
>> >> vendor-specific installation
>> >> * Allows llvm math intrinsics to be lowered to device-specific
code
>> >> * Keeps definitions of code used to implement intrinsics until
after
>> all
>> >> potential relevant intrinsics (including those created by LLVM
passes)
>> have
>> >> been lowered.
>> >>
>> > Yep, constraints sound right. Back ends can emit calls to these
>> functions
>> > too, but I think nvptx/amdgcn do not. Perhaps they would like to
be
>> able to
>> > in places.
>> >
>> >   *Initial Design:*
>> >
>> >> ... metadata / aliases ...
>> >>
>> > Design would work, lets us continue with the header files we have
now.
>> > Avoids some tedious programming, i.e. if we approached this as the
usual
>> > back end lowering, where intrinsics / isd nodes are emitted as
named
>> > function calls. That can be mostly driven by a table lookup as the
>> function
>> > arity is limited. It is (i.e. was) quite tedious programming that
in
>> ISel.
>> > Doing basically the same thing for SDag + GIsel / ptx + gcn, with
>> > associated tests, is also unappealing.
>> >
>> > The set of functions near libm is small and known. We would need
to mark
>> > 'sin' as 'implemented by' slightly different
functions for nvptx and
>> > amdgcn, and some of them need thin wrapper code (e.g. modf in
amdgcn
>> takes
>> > an argument by pointer). It would be helpful for the fortran
runtime
>> > libraries effort if the implementation didn't use inline code
in
>> headers.
>> >
>> > There's very close to a 1:1 mapping between the two gpu
libraries, even
>> > some extensions to libm exist in both. Therefore we could write a
table,
>> > {llvm.sin.f64, "sin", __nv_sin, __ocml_sin},
>> > with NULL or similar for functions that aren't available.
>> >
>> > A function level IR pass, called late in the pipeline, crawls the
call
>> > instructions and rewrites based on simple rules and that table.
That is,
>> > would rewrite a call to llvm.sin.f64 to a call to __ocml_sin.
Exactly
>> the
>> > same net effect as a header file containing metadata annotations,
>> except we
>> > don't need the metadata machinery and we can use a single
trivial IR
>> pass
>> > for N architectures (by adding a column). Pass can do the odd ugly
thing
>> > like impedance match function type easily enough.
>> >
>> > The other side of the problem - that functions once introduced
have to
>> hang
>> > around until we are sure they aren't needed - is the same as
in your
>> > proposal. My preference would be to introduce the libdevice
functions
>> > immediately after the lowering pass above, but we can inject it
early
>> and
>> > tag them to avoid erasure instead. Kind of need that to handle the
>> > cos->cosf transform anyway.
>> >
>> > Quite similar to the 'in theory ... table' suggestion,
which I like
>> because
>> > I remember it being far simpler than the sdag rewrite rules.
>> >
>> > Thanks!
>> >
>> > Jon
>> >
>> >
>> > _______________________________________________
>> > LLVM Developers mailing list
>> > llvm-dev at lists.llvm.org
>> > https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>
>
>
> --
> --Artem Belevich
>

-- 
--Artem Belevich
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20211117/b881e906/attachment.html>

llvm dev - Nov 2021 - NVPTX codegen for llvm.sin (and friends)

[llvm-dev] NVPTX codegen for llvm.sin (and friends)

[llvm-dev] NVPTX codegen for llvm.sin (and friends)