thr3ads.net - llvm dev - [llvm-dev] NVPTX codegen for llvm.sin (and friends) [Mar 2021]

If this information is useful, please help other people find it:
Share via:

Artem Belevich via llvm-dev

2021-Mar-11 19:37 UTC

[llvm-dev] NVPTX codegen for llvm.sin (and friends)

On Thu, Mar 11, 2021 at 10:54 AM Johannes Doerfert <
johannesdoerfert at gmail.com> wrote:
> I certainly agree we should try to avoid a hard-coded mapping
> in C++.
>
> I could see something like:
>
> ```
> __attribute__((implementation("llvm.cos"))
> double __nv_cos(...) { ... }
>
> ```
>
> and a pass that transforms all calls to a function with an
> "implementation" to calls to that implementation. Maybe
> later we attach a score/priority ;)
>
>I'm not sure how that would work.
Where would you place that `__attribute__((implementation))` ? We do not
have the definitions for `__nv_*` as they come from NVIDIA-provided
bitcode. We could add the attribute to the declaration in
`__clang_cuda_libdevice_declares.h`.
How does LLVM handle the differences in function attributes between
function declaration and definition? Will there be trouble when we link in
the actual __nv_cos from the libdevice that would not have that attribute?

Another potential gotcha is that for the functions that can't be directly
mapped 1:1 to `__nv_*` counterparts, we'd still need to provide the
implementation ourselves. We will not know whether the implementation will
be used until after the substitution pass, so we'll need to make sure
it's
not DCE'd until then. It appears to be the same issue (though on a smaller
scale) as with linking in libdevice directly.

Let's take a step back and figure out what are the issues we want to solve.

The top-level goal is to provide implementation for LLVM intrinsics. For
now let's stick with libm-related ones.
What we have is the libdevice bitcode which uses different function names
and provides a subset of the functionality we need.
What we miss is
  - something to connect LLVM's libcalls to the GPU-side implementation,
  - additional code to provide implementations for the functions that are
missing or different in libdevice.

Considering that we want this to work in LLVM, the additional code would
have to be a bitcode and it would have to exist in addition to libdevice.
Our options for the mapping between LLVM intrinsics and the implementation
are
* intrinsic -> __nv_* equivalent mapping pass
   This would still need additional bitcode for the missing/different
functions.
* lower libcalls to the standard libm APIs, implement libm -> __nv_*
mapping in our own bitcode.

Considering that additional bitcode is needed in both cases, I believe that
the second approach makes more sense.
LLVM does not need to know or care about what's provided by libdevice, and
we'd have more flexibility, compared to what we could do in the mapping
pass. It also makes it easy to substitute a different implementation, if we
have or need one.

WDYT?

--Artem

>
>
> On 3/10/21 8:44 PM, William Moses wrote:
> > We could also consider doing something slightly broader.
> >
> > For example we could define a special attribute on top of the llvm.cos
> > call/declaration etc with metadata or an attribute that points to the
> > actual __nv_cos function. Then in a subsequent lowering pass the
> > corresponding intrinsic with the relevant attribute has its uses
replaced
> > by the actual function.
> >
> >
> > On Wed, Mar 10, 2021 at 7:57 PM Johannes Doerfert <
> > johannesdoerfert at gmail.com> wrote:
> >
> >> On 3/10/21 6:22 PM, Artem Belevich wrote:
> >>> On Wed, Mar 10, 2021 at 3:44 PM Johannes Doerfert <
> >>> johannesdoerfert at gmail.com> wrote:
> >>>
> >>>> On 3/10/21 4:38 PM, Artem Belevich wrote:
> >>>>> On Wed, Mar 10, 2021 at 1:55 PM Johannes Doerfert <
> >>>>> johannesdoerfert at gmail.com> wrote:
> >>>>>
> >>>>>> On 3/10/21 3:25 PM, Artem Belevich wrote:
> >>>>>>> On Wed, Mar 10, 2021 at 12:57 PM Johannes
Doerfert <
> >>>>>>> johannesdoerfert at gmail.com> wrote:
> >>>>>>>
> >>>>>>>> Right. We could keep the definition of
__nv_cos and friends
> >>>>>>>> around. Right now, -ffast-math might just
crash on the user,
> >>>>>>>> which is arguably a bad thing. I can also
see us benefiting
> >>>>>>>> in various other ways from llvm.cos uses
instead of __nv_cos
> >>>>>>>> (assuming precision is according to the
user requirements but
> >>>>>>>> that is always a condition).
> >>>>>>>>
> >>>>>>>> It could be as simple as introducing
__nv_cos into
> >>>>>>>> "llvm.used" and a backend
matching/rewrite pass.
> >>>>>>>>
> >>>>>>>> If the backend knew the libdevice location
it could even pick
> >>>>>>>> the definitions from there. Maybe we could
link libdevice late
> >>>>>>>> instead of eager?
> >>>>>>>>
> >>>>>>> It's possible, but it would require
plumbing in CUDA SDK awareness
> >> into
> >>>>>>> LLVM. While clang driver can deal with that,
LLVM currently can't.
> >> The
> >>>>>>> bitcode library path would have to be provided
by the user.
> >>>>>> The PTX backend could arguably be CUDA SDK aware,
IMHO, it would
> >>>>>> even be fine if the middle-end does the remapping
to get inlining
> >>>>>> and folding benefits also after __nv_cos is used.
See below.
> >>>>>>
> >>>>>>
> >>>>>>> The standard library as bitcode raises some
questions.
> >>>>>> Which standard library? CUDAs libdevice is a
bitcode library, right?
> >>>>>>
> >>>>> It's whatever LLVM will need to lower libcalls to.
libdevice bitcode
> is
> >>>> the
> >>>>> closest approximation of that we have at the moment.
> >>>>>
> >>>>>
> >>>>>>> * When do we want to do the linking? If we do
it at the beginning,
> >> then
> >>>>>> the
> >>>>>>> question is how to make sure unused functions
are not eliminated
> >> before
> >>>>>> we
> >>>>>>> may need them, as we don't know apriori
what's going to be needed.
> We
> >>>>>> also
> >>>>>>> do want the unused functions to be gone after
we're done. Linking
> it
> >> in
> >>>>>>> early would allow optimizing the code better
at the expense of
> having
> >>>> to
> >>>>>>> optimize a lot of code we'll throw away.
Linking it in late has
> less
> >>>>>>> overhead, but leaves the linked in bitcode
unoptimized, though it's
> >>>>>>> probably in the ballpark of what would happen
with a real library
> >> call.
> >>>>>>> I.e. no inlining, etc.
> >>>>>>>
> >>>>>>> * It incorporates linking into LLVM, which is
not LLVM's job.
> >> Arguably,
> >>>>>> the
> >>>>>>> line should be drawn at the lowering to
libcalls as it's done for
> >> other
> >>>>>>> back-ends. However, we're also constrained
to by the need to have
> the
> >>>>>>> linking done before we generate PTX which
prevents doing it after
> >> LLVM
> >>>> is
> >>>>>>> done generating an object file.
> >>>>>> I'm confused. Clang links in libdevice.bc
early.
> >>>>> Yes. Because that's where it has to happen if we
want to keep LLVM
> >>>> unaware
> >>>>> of CUDA SDK.
> >>>>> It does not have to be the case if/when LLVM can do
the linking
> itself.
> >>>>>
> >>>>>
> >>>>>> If we make sure
> >>>>>> `__nv_cos` is not deleted early, we can at any
point "lower"
> >> `llvm.cos`
> >>>>>> to `__nv_cos` which is available. After the
lowering we can remove
> >>>>>> the artificial uses of `__nv_XXX` functions that
we used to keep the
> >>>>>> definitions around in order to remove them from
the final result.
> >>>>>>
> >>>>> This is the 'link early' approach, I
should've been explicit that
> it's
> >>>>> 'link early *everything*' as opposed to
linking only what's needed at
> >> the
> >>>>> beginning.
> >>>>> It would work at the expense of having to
process/optimize 500KB
> worth
> >> of
> >>>>> bitcode for every compilation, whether it needs it or
not.
> >>>>>
> >>>>>
> >>>>>> We get the benefit of having `llvm.cos` for some
of the pipeline,
> >>>>>> we know it does not have all the bad effects while
`__nv_cos` is
> >> defined
> >>>>>> with inline assembly. We also get the benefit of
inlining `__nv_cos`
> >>>>>> and folding the implementation based on the
arguments. Finally,
> >>>>>> this should work with the existing pipeline, the
linking is the same
> >>>>>> as before, all we do is to keep the definitions
alive longer and
> >>>>>> lower `llvm.cos` to `__nv_cos` in a middle end
pass.
> >>>>>>
> >>>>> Again, I agree that it is doable.
> >>>>>
> >>>>>
> >>>>>
> >>>>>> This might be similar to the PTX solution you
describe below but I
> >> feel
> >>>>>> we get the inline benefit from this without
actually changing the
> >>>> pipeline
> >>>>>> at all.
> >>>>>>
> >>>>> So, to summarize:
> >>>>> * link the library as bitcode early, add artificial
placeholders for
> >>>>> everything, compile, remove placeholders and DCE
unused stuff away.
> >>>>>      Pros:
> >>>>>         - we're already doing most of it before
clang hands hands off
> >> IR to
> >>>>> LLVM, so it just pushes it a bit lower in the
compilation.
> >>>>>      Cons:
> >>>>>         - runtime cost of optimizing libdevice
bitcode,
> >>>>>         - libdevice may be required for all NVPTX
compilations?
> >>>>>
> >>>>> * link the library as bitcode late.
> >>>>>       Pros:
> >>>>>         - lower runtime cost than link-early approach.
> >>>>>       Cons:
> >>>>>         - We'll need to make sure that NVVMReflect
pass processes the
> >>>> library.
> >>>>>         - less optimizations on the library functions.
Some of the
> code
> >>>> gets
> >>>>> DCE'ed away after NVVMReflect and the rest could
be optimized better.
> >>>>>         - libdevice may be required for all NVPTX
compilations?
> >>>>> * 'link' with the library as PTX appended as
text to LLVM's output
> and
> >>>> let
> >>>>> ptxas do the 'linking'
> >>>>>      Pros:  LLVM remains agnostic of CUDA SDK
installation details.
> All
> >> it
> >>>>> does is allows lowering libcalls and leaves their
resolution to the
> >>>>> external tools.
> >>>>>      Cons: Need to have the PTX library somewhere and
need to
> integrate
> >> the
> >>>>> 'linking' into the compilation process
somehow.
> >>>>>
> >>>>> Neither is particularly good. If the runtime overhead
of link-early
> is
> >>>>> acceptable, then it may be a winner here, by a very
small margin.
> >>>>> link-as-PTX may be better conceptually as it keeps
linking and
> >>>> compilation
> >>>>> separate.
> >>>>>
> >>>>> As for the practical steps, here's what we need:
> >>>>> - allow libcall lowering in NVPTX, possibly guarded by
a flag. This
> is
> >>>>> needed for all of the approaches above.
> >>>>> - teach LLVM how to link in bitcode (and, possibly,
control
> early/late
> >>>> mode)
> >>>>> - teach clang driver to delegate libdevice linking to
LLVM.
> >>>>>
> >>>>> This will allow us to experiment with all three
approaches and see
> what
> >>>>> works best.
> >>>> I think if we embed knowledge about the nv_XXX functions
we can
> >>>> even get away without the cons you listed for early
linking above.
> >>>>
> >>> WDYM by `embed knowledge about the nv_XXX functions`? By
linking those
> >>> functions in? Of do you mean that we should just declare them
> >>> before/instead of linking libdevice in?
> >> I mean by providing the "libcall lowering" pass. So the
knowledge
> >> that llvm.cos maps to __nv_cos.
> >>
> >>>
> >>>> For early link I'm assuming an order similar to [0]
but I also discuss
> >>>> the case where we don't link libdevice early for a TU.
> >>>>
> >>> That link just describes the steps needed to use libdevice. It
does not
> >>> deal with how/where it fits in the LLVM pipeline.
> >>> The gist is that NVVMreflect replaces some conditionals with
constants.
> >>> libdevice uses that as a poor man's IR preprocessor,
conditionally
> >> enabling
> >>> different implementations and relying on DCE and constant
folding to
> >> remove
> >>> unused parts and eliminate the now useless branches.
> >>> While running NVVM alone will make libdevice code valid and
usable, it
> >>> would still benefit from further optimizations. I do not know
to what
> >>> degree, though.
> >>>
> >>>
> >>>> Link early:
> >>>> 1) clang emits module.bc and links in libdevice.bc but
with the
> >>>>       `optnone`, `noinline`, and "used"
attribute for functions in
> >>>>       libdevice. ("used" is not an attribute but
could as well be.)
> >>>>       At this stage module.bc might call __nv_XXX or
llvm.XXX freely
> >>>>       as defined by -ffast-math and friends.
> >>>>
> >>> That could work. Just carrying extra IR around would probably
be OK.
> >>> We may want to do NVVMReflect as soon as we have it linked in
and,
> maybe,
> >>> allow optimizing the functions that are explicitly used
already.
> >> Right. NVVMReflect can be run twice and with `alwaysinline`
> >> on the call sites of __nv_XXX functions we will actually
> >> inline and optimize them while the definitions are just
"dragged
> >> along" in case we need them later.
> >>
> >>
> >>>> 2) Run some optimizations in the middle end, maybe till
the end of
> >>>>       the inliner loop, unsure.
> >>>> 3) Run a libcall lowering pass and another NVVMReflect
pass (or the
> >>>>       only instance thereof). We effectively remove all
llvm.XXX calls
> >>>       in favor of __nv_XXX now. Note that we haven't spend
(much) time
> >>>>       on the libdevice code as it is optnone and most
passes are good
> >>>>       at skipping those. To me, it's unclear if the
used parts should
> >>>>       not be optimized before we inline them anyway to
avoid redoing
> >>>>       the optimizations over and over (per call site).
That needs
> >>>>       measuring I guess. Also note that we can still
retain the
> current
> >>>>       behavior for direct calls to __nv_XXX if we mark the
call sites
> >>>>       as `alwaysinline`, or at least the behavior is
almost like the
> >>>>       current one is.
> >>>> 4) Run an always inliner pass on the __nv_XXX calls
because it is
> >>>>       something we would do right now. Alternatively,
remove `optnone`
> >>>>       and `noinline` from the __nv_XXX calls.
> >>>> 5) Continue with the pipeline as before.
> >>>>
> >>>>
> >>> SGTM.
> >>>
> >>>
> >>>> As mentioned above, `optnone` avoids spending time on the
libdevice
> >>>> until we "activate" it. At that point (globals)
DCE can be scheduled
> >>>> to remove all unused parts right away. I don't think
this is (much)
> >>>> more expensive than linking libdevice early right now.
> >>>>
> >>>> Link late, aka. translation units without libdevice:
> >>>> 1) clang emits module.bc but does not link in
libdevice.bc, it will be
> >>>>       made available later. We still can mix __nv_XXX and
llvm.XXX
> calls
> >>>>       freely as above.
> >>>> 2) Same as above.
> >>>> 3) Same as above.
> >>>> 4) Same as above but effectively a no-op, no __nv_XXX
definitions are
> >>>>       available.
> >>>> 5) Same as above.
> >>>>
> >>>>
> >>>> I might misunderstand something about the current pipeline
but from
> [0]
> >>>> and the experiments I run locally it looks like the above
should cover
> >> all
> >>>> the cases. WDYT?
> >>>>
> >>>>
> >>> The `optnone` trick may indeed remove much of the practical
differences
> >>> between the early/late approaches.
> >>> In principle it should work.
> >>>
> >>> Next question is -- is libdevice sufficient to satisfy
LLVM's
> assumptions
> >>> about the standard library.
> >>> While it does provide most of the equivalents of libm
functions, the
> set
> >> is
> >>> not complete and some of the functions differ from their libm
> >> counterparts.
> >>> The differences are minor, so we should be able to deal with
it by
> >>> generating few wrapper functions for the odd cases.
> >>> Here's what clang does to provide math functions using
libdevice:
> >>>
> >>
>
https://github.com/llvm/llvm-project/blob/main/clang/lib/Headers/__clang_cuda_math.h
> >>
> >> Right now, clang will generate any llvm intrinsic and we crash, so
> anything
> >> else is probably a step in the right direction. Eventually, we
should
> >> "lower"
> >> all intrinsics that the NVPTX backend can't handle or at least
emit a
> nice
> >> error message. Preferably, clang would know what we can't deal
with and
> not
> >> generate intinsic calls for those in the first place.
> >>
> >>
> >>> The most concerning aspect of libdevice is that we don't
know when
> we'll
> >> no
> >>> longer be able to use the libdevice bitcode? My understanding
is that
> IR
> >>> does not guarantee binary stability and at some point we may
just be
> >> unable
> >>> to use it. Ideally we need our own libm for GPUs.
> >> For OpenMP I did my best to avoid writing libm (code) for GPUs by
piggy
> >> backing on CUDA and libc++ implementations, I hope it will stay
that
> way.
> >> That said, if the need arises we might really have to port libc++
to the
> >> GPUs.
> >>
> >> Back to the problem with libdevice. I agree that the solution of
NVIDIA
> >> to ship a .bc library is suboptimal but with the existing, or an
> extended,
> >> auto-upgrader we might be able to make that work reasonably well
for the
> >> foreseeable future. That problem is orthogonal to what we are
discussing
> >> above, I think.
> >>
> >> ~ Johannes
> >>
> >>
> >>> --Artem
> >>>
> >>>
> >>>> ~ Johannes
> >>>>
> >>>>
> >>>> P.S. If the rewrite capability (aka libcall lowering) is
generic we
> >> could
> >>>>         use the scheme for many other things as well.
> >>>>
> >>>>
> >>>> [0]
https://llvm.org/docs/NVPTXUsage.html#linking-with-libdevice
> >>>>
> >>>>
> >>>>> --Artem
> >>>>>
> >>>>>
> >>>>>> ~ Johannes
> >>>>>>
> >>>>>>
> >>>>>>> One thing that may work within the existing
compilation model is to
> >>>>>>> pre-compile the standard library into PTX and
then textually embed
> >>>>>> relevant
> >>>>>>> functions into the generated PTX, thus pushing
the 'linking' phase
> >> past
> >>>>>> the
> >>>>>>> end of LLVM's compilation and make it look
closer to the standard
> >>>>>>> compile/link process. This way we'd only
enable libcall lowering in
> >>>>>> NVPTX,
> >>>>>>> assuming that the library functions will be
magically available out
> >>>>>> there.
> >>>>>>> Injection of PTX could be done with an
external script outside of
> >> LLVM
> >>>>>> and
> >>>>>>> it could be incorporated into clang driver.
Bonus points for the
> fact
> >>>>>> that
> >>>>>>> this scheme is compatible with -fgpu-rdc out
of the box -- assemble
> >> the
> >>>>>> PTX
> >>>>>>> with `ptxas -rdc` and then actually link with
the library, instead
> of
> >>>>>>> injecting its PTX before invoking ptxas.
> >>>>>>>
> >>>>>>> --Artem
> >>>>>>>
> >>>>>>> Trying to figure out a good way to have the
cake and eat it too.
> >>>>>>>> ~ Johannes
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On 3/10/21 2:49 PM, William Moses wrote:
> >>>>>>>>> Since clang (and arguably any other
frontend that uses) should
> link
> >>>> in
> >>>>>>>>> libdevice, could we lower these
intrinsics to the libdevice code?
> >>>>>>> The linking happens *before* LLVM gets to work
on IR.
> >>>>>>> As I said, it's a workaround, not the
solution. It's possible for
> >> LLVM
> >>>> to
> >>>>>>> still attempt lowering something in the IR
into a libcall and we
> >> would
> >>>>>> not
> >>>>>>> be able to deal with that. It happens to work
well enough in
> >> practice.
> >>>>>>> Do you have an example where you see the
problem with -ffast-math?
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>>> For example, consider compiling the
simple device function below:
> >>>>>>>>>
> >>>>>>>>> ```
> >>>>>>>>> //
/mnt/sabrent/wmoses/llvm13/build/bin/clang tmp.cu -S
> -emit-llvm
> >>>>>>>>>       --cuda-path=/usr/local/cuda-11.0
> -L/usr/local/cuda-11.0/lib64
> >>>>>>>>> --cuda-gpu-arch=sm_37
> >>>>>>>>> __device__ double f(double x) {
> >>>>>>>>>          return cos(x);
> >>>>>>>>> }
> >>>>>>>>> ```
> >>>>>>>>>
> >>>>>>>>> The LLVM module for it is as follows:
> >>>>>>>>>
> >>>>>>>>> ```
> >>>>>>>>> ...
> >>>>>>>>> define dso_local double @_Z1fd(double
%x) #0 {
> >>>>>>>>> entry:
> >>>>>>>>>        %__a.addr.i = alloca double,
align 8
> >>>>>>>>>        %x.addr = alloca double, align
8
> >>>>>>>>>        store double %x, double*
%x.addr, align 8
> >>>>>>>>>        %0 = load double, double*
%x.addr, align 8
> >>>>>>>>>        store double %0, double*
%__a.addr.i, align 8
> >>>>>>>>>        %1 = load double, double*
%__a.addr.i, align 8
> >>>>>>>>>        %call.i = call contract double
@__nv_cos(double %1) #7
> >>>>>>>>>        ret double %call.i
> >>>>>>>>> }
> >>>>>>>>>
> >>>>>>>>> define internal double
@__nv_cos(double %a) #1 {
> >>>>>>>>>        %q.i = alloca i32, align 4
> >>>>>>>>> ```
> >>>>>>>>>
> >>>>>>>>> Obviously we would need to do
something to ensure these functions
> >>>> don't
> >>>>>>>> get
> >>>>>>>>> deleted prior to their use in lowering
from intrinsic to
> libdevice.
> >>>>>>>>> ...
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On Wed, Mar 10, 2021 at 3:39 PM Artem
Belevich <tra at google.com>
> >>>> wrote:
> >>>>>>>>>> On Wed, Mar 10, 2021 at 11:41 AM
Johannes Doerfert <
> >>>>>>>>>> johannesdoerfert at gmail.com>
wrote:
> >>>>>>>>>>
> >>>>>>>>>>> Artem, Justin,
> >>>>>>>>>>>
> >>>>>>>>>>> I am running into a problem
and I'm curious if I'm missing
> >>>> something
> >>>>>> or
> >>>>>>>>>>> if the support is simply
missing.
> >>>>>>>>>>> Am I correct to assume the
NVPTX backend does not deal with
> >>>>>> `llvm.sin`
> >>>>>>>>>>> and friends?
> >>>>>>>>>>>
> >>>>>>>>>> Correct. It can't deal with
anything that may need to lower to a
> >>>>>>>> standard
> >>>>>>>>>> library call.
> >>>>>>>>>>
> >>>>>>>>>>> This is what I see, with some
variations:
> >>>>>> https://godbolt.org/z/PxsEWs
> >>>>>>>>>>> If this is missing in the
backend, is there a plan to get this
> >>>>>> working,
> >>>>>>>>>>> I'd really like to have
the
> >>>>>>>>>>> intrinsics in the middle end
rather than __nv_cos, not to
> mention
> >>>>>> that
> >>>>>>>>>>> -ffast-math does emit
intrinsics
> >>>>>>>>>>> and crashes.
> >>>>>>>>>>>
> >>>>>>>>>> It all boils down to the fact that
PTX does not have the
> standard
> >>>>>>>>>> libc/libm which LLVM could lower
the calls to, nor does it have
> a
> >>>>>>>> 'linking'
> >>>>>>>>>> phase where we could link such a
library in, if we had it.
> >>>>>>>>>>
> >>>>>>>>>> Libdevice bitcode does provide the
implementations for some of
> the
> >>>>>>>>>> functions (though with a __nv_
prefix) and clang links it in in
> >>>> order
> >>>>>> to
> >>>>>>>>>> avoid generating IR that LLVM
can't handle, but that's a
> >> workaround
> >>>>>> that
> >>>>>>>>>> does not help LLVM itself.
> >>>>>>>>>>
> >>>>>>>>>> --Artem
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>> ~ Johannes
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> --
> >>>>>>>>>>> ───────────────────
> >>>>>>>>>>> ∽ Johannes (he/his)
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>> --
> >>>>>>>>>> --Artem Belevich
> >>>>>>>>>>
>

-- 
--Artem Belevich
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20210311/1e32d98e/attachment.html>

Johannes Doerfert via llvm-dev

2021-Mar-12 00:10 UTC

head link

[llvm-dev] NVPTX codegen for llvm.sin (and friends)

On 3/11/21 1:37 PM, Artem Belevich wrote:> On Thu, Mar 11, 2021 at 10:54 AM Johannes Doerfert <
> johannesdoerfert at gmail.com> wrote:
>
>> I certainly agree we should try to avoid a hard-coded mapping
>> in C++.
>>
>> I could see something like:
>>
>> ```
>> __attribute__((implementation("llvm.cos"))
>> double __nv_cos(...) { ... }
>>
>> ```
>>
>> and a pass that transforms all calls to a function with an
>> "implementation" to calls to that implementation. Maybe
>> later we attach a score/priority ;)
>>
>>
> I'm not sure how that would work.
> Where would you place that `__attribute__((implementation))` ? We do not
> have the definitions for `__nv_*` as they come from NVIDIA-provided
> bitcode. We could add the attribute to the declaration in
> `__clang_cuda_libdevice_declares.h`.
> How does LLVM handle the differences in function attributes between
> function declaration and definition? Will there be trouble when we link in
> the actual __nv_cos from the libdevice that would not have that attribute?
>
> Another potential gotcha is that for the functions that can't be
directly
> mapped 1:1 to `__nv_*` counterparts, we'd still need to provide the
> implementation ourselves. We will not know whether the implementation will
> be used until after the substitution pass, so we'll need to make sure
it's
> not DCE'd until then. It appears to be the same issue (though on a
smaller
> scale) as with linking in libdevice directly.
>
> Let's take a step back and figure out what are the issues we want to
solve.
>
> The top-level goal is to provide implementation for LLVM intrinsics. For
> now let's stick with libm-related ones.
> What we have is the libdevice bitcode which uses different function names
> and provides a subset of the functionality we need.
> What we miss is
>    - something to connect LLVM's libcalls to the GPU-side
implementation,
>    - additional code to provide implementations for the functions that are
> missing or different in libdevice.
>
> Considering that we want this to work in LLVM, the additional code would
> have to be a bitcode and it would have to exist in addition to libdevice.
> Our options for the mapping between LLVM intrinsics and the implementation
> are
> * intrinsic -> __nv_* equivalent mapping pass
>     This would still need additional bitcode for the missing/different
> functions.
> * lower libcalls to the standard libm APIs, implement libm -> __nv_*
> mapping in our own bitcode.
>
> Considering that additional bitcode is needed in both cases, I believe that
> the second approach makes more sense.
I really hope to avoid any additional bitcode, there are too many
drawbacks and basically no benefits, IMHO.

> LLVM does not need to know or care about what's provided by libdevice,
and
> we'd have more flexibility, compared to what we could do in the mapping
> pass. It also makes it easy to substitute a different implementation, if we
> have or need one.
I agree that LLVM (core) should not know about __nv_*, that's why I 
suggested
the `__attribute__((implements("...")))` approach. My preferred
solution
is still to annotate our declarations of __nv_* and point to the
llvm.intrinsics (name) from there. If we have a missing mapping, we 
point to an
intrinsic from a definition that lives in the Clang headers next to the
__nv_*  declarations.

This does not yet work because -mlink-builtin-bitcode (which I assume
triggers the llvm-link logic) will drop the attributes of a declaration
if a definition is found. I think that should not be the case anyway
such that the union of attributes is set.

The benefit I see for the above is that the mapping is tied to the
declarations and doesn't live in a tablegen file far away. It works well
even if we can't map 1:1, and we could even restrict the "used"
attribute
to anything that has an "implements" attribute. So:

```
__nv_A() { ... } // called, inlined and optimized as before, DCE'ed after.

__nv_B() { ... } // not called, DCE'ed.

__attribute__((implements("llvm.C"))
__nv_C() { ... } // calls are inlined and optimized as before, not DCE'ed
                  // though because of the attribute. Replaces llvm.C as
                  // callee in the special pass.
```

So "implements" gives you a way to statically replace a function
declaration
or definition with another one. I could see it being used to provide other
intrinsics to platforms with backends that don't support them.

Does that make some sense?

~ Johannes

>
> WDYT?
>
> --Artem
>
>
>>
>> On 3/10/21 8:44 PM, William Moses wrote:
>>> We could also consider doing something slightly broader.
>>>
>>> For example we could define a special attribute on top of the
llvm.cos
>>> call/declaration etc with metadata or an attribute that points to
the
>>> actual __nv_cos function. Then in a subsequent lowering pass the
>>> corresponding intrinsic with the relevant attribute has its uses
replaced
>>> by the actual function.
>>>
>>>
>>> On Wed, Mar 10, 2021 at 7:57 PM Johannes Doerfert <
>>> johannesdoerfert at gmail.com> wrote:
>>>
>>>> On 3/10/21 6:22 PM, Artem Belevich wrote:
>>>>> On Wed, Mar 10, 2021 at 3:44 PM Johannes Doerfert <
>>>>> johannesdoerfert at gmail.com> wrote:
>>>>>
>>>>>> On 3/10/21 4:38 PM, Artem Belevich wrote:
>>>>>>> On Wed, Mar 10, 2021 at 1:55 PM Johannes Doerfert
<
>>>>>>> johannesdoerfert at gmail.com> wrote:
>>>>>>>
>>>>>>>> On 3/10/21 3:25 PM, Artem Belevich wrote:
>>>>>>>>> On Wed, Mar 10, 2021 at 12:57 PM Johannes
Doerfert <
>>>>>>>>> johannesdoerfert at gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Right. We could keep the definition of
__nv_cos and friends
>>>>>>>>>> around. Right now, -ffast-math might
just crash on the user,
>>>>>>>>>> which is arguably a bad thing. I can
also see us benefiting
>>>>>>>>>> in various other ways from llvm.cos
uses instead of __nv_cos
>>>>>>>>>> (assuming precision is according to the
user requirements but
>>>>>>>>>> that is always a condition).
>>>>>>>>>>
>>>>>>>>>> It could be as simple as introducing
__nv_cos into
>>>>>>>>>> "llvm.used" and a backend
matching/rewrite pass.
>>>>>>>>>>
>>>>>>>>>> If the backend knew the libdevice
location it could even pick
>>>>>>>>>> the definitions from there. Maybe we
could link libdevice late
>>>>>>>>>> instead of eager?
>>>>>>>>>>
>>>>>>>>> It's possible, but it would require
plumbing in CUDA SDK awareness
>>>> into
>>>>>>>>> LLVM. While clang driver can deal with
that, LLVM currently can't.
>>>> The
>>>>>>>>> bitcode library path would have to be
provided by the user.
>>>>>>>> The PTX backend could arguably be CUDA SDK
aware, IMHO, it would
>>>>>>>> even be fine if the middle-end does the
remapping to get inlining
>>>>>>>> and folding benefits also after __nv_cos is
used. See below.
>>>>>>>>
>>>>>>>>
>>>>>>>>> The standard library as bitcode raises some
questions.
>>>>>>>> Which standard library? CUDAs libdevice is a
bitcode library, right?
>>>>>>>>
>>>>>>> It's whatever LLVM will need to lower libcalls
to. libdevice bitcode
>> is
>>>>>> the
>>>>>>> closest approximation of that we have at the
moment.
>>>>>>>
>>>>>>>
>>>>>>>>> * When do we want to do the linking? If we
do it at the beginning,
>>>> then
>>>>>>>> the
>>>>>>>>> question is how to make sure unused
functions are not eliminated
>>>> before
>>>>>>>> we
>>>>>>>>> may need them, as we don't know apriori
what's going to be needed.
>> We
>>>>>>>> also
>>>>>>>>> do want the unused functions to be gone
after we're done. Linking
>> it
>>>> in
>>>>>>>>> early would allow optimizing the code
better at the expense of
>> having
>>>>>> to
>>>>>>>>> optimize a lot of code we'll throw
away. Linking it in late has
>> less
>>>>>>>>> overhead, but leaves the linked in bitcode
unoptimized, though it's
>>>>>>>>> probably in the ballpark of what would
happen with a real library
>>>> call.
>>>>>>>>> I.e. no inlining, etc.
>>>>>>>>>
>>>>>>>>> * It incorporates linking into LLVM, which
is not LLVM's job.
>>>> Arguably,
>>>>>>>> the
>>>>>>>>> line should be drawn at the lowering to
libcalls as it's done for
>>>> other
>>>>>>>>> back-ends. However, we're also
constrained to by the need to have
>> the
>>>>>>>>> linking done before we generate PTX which
prevents doing it after
>>>> LLVM
>>>>>> is
>>>>>>>>> done generating an object file.
>>>>>>>> I'm confused. Clang links in libdevice.bc
early.
>>>>>>> Yes. Because that's where it has to happen if
we want to keep LLVM
>>>>>> unaware
>>>>>>> of CUDA SDK.
>>>>>>> It does not have to be the case if/when LLVM can do
the linking
>> itself.
>>>>>>>
>>>>>>>> If we make sure
>>>>>>>> `__nv_cos` is not deleted early, we can at any
point "lower"
>>>> `llvm.cos`
>>>>>>>> to `__nv_cos` which is available. After the
lowering we can remove
>>>>>>>> the artificial uses of `__nv_XXX` functions
that we used to keep the
>>>>>>>> definitions around in order to remove them from
the final result.
>>>>>>>>
>>>>>>> This is the 'link early' approach, I
should've been explicit that
>> it's
>>>>>>> 'link early *everything*' as opposed to
linking only what's needed at
>>>> the
>>>>>>> beginning.
>>>>>>> It would work at the expense of having to
process/optimize 500KB
>> worth
>>>> of
>>>>>>> bitcode for every compilation, whether it needs it
or not.
>>>>>>>
>>>>>>>
>>>>>>>> We get the benefit of having `llvm.cos` for
some of the pipeline,
>>>>>>>> we know it does not have all the bad effects
while `__nv_cos` is
>>>> defined
>>>>>>>> with inline assembly. We also get the benefit
of inlining `__nv_cos`
>>>>>>>> and folding the implementation based on the
arguments. Finally,
>>>>>>>> this should work with the existing pipeline,
the linking is the same
>>>>>>>> as before, all we do is to keep the definitions
alive longer and
>>>>>>>> lower `llvm.cos` to `__nv_cos` in a middle end
pass.
>>>>>>>>
>>>>>>> Again, I agree that it is doable.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> This might be similar to the PTX solution you
describe below but I
>>>> feel
>>>>>>>> we get the inline benefit from this without
actually changing the
>>>>>> pipeline
>>>>>>>> at all.
>>>>>>>>
>>>>>>> So, to summarize:
>>>>>>> * link the library as bitcode early, add artificial
placeholders for
>>>>>>> everything, compile, remove placeholders and DCE
unused stuff away.
>>>>>>>       Pros:
>>>>>>>          - we're already doing most of it
before clang hands hands off
>>>> IR to
>>>>>>> LLVM, so it just pushes it a bit lower in the
compilation.
>>>>>>>       Cons:
>>>>>>>          - runtime cost of optimizing libdevice
bitcode,
>>>>>>>          - libdevice may be required for all NVPTX
compilations?
>>>>>>>
>>>>>>> * link the library as bitcode late.
>>>>>>>        Pros:
>>>>>>>          - lower runtime cost than link-early
approach.
>>>>>>>        Cons:
>>>>>>>          - We'll need to make sure that
NVVMReflect pass processes the
>>>>>> library.
>>>>>>>          - less optimizations on the library
functions. Some of the
>> code
>>>>>> gets
>>>>>>> DCE'ed away after NVVMReflect and the rest
could be optimized better.
>>>>>>>          - libdevice may be required for all NVPTX
compilations?
>>>>>>> * 'link' with the library as PTX appended
as text to LLVM's output
>> and
>>>>>> let
>>>>>>> ptxas do the 'linking'
>>>>>>>       Pros:  LLVM remains agnostic of CUDA SDK
installation details.
>> All
>>>> it
>>>>>>> does is allows lowering libcalls and leaves their
resolution to the
>>>>>>> external tools.
>>>>>>>       Cons: Need to have the PTX library somewhere
and need to
>> integrate
>>>> the
>>>>>>> 'linking' into the compilation process
somehow.
>>>>>>>
>>>>>>> Neither is particularly good. If the runtime
overhead of link-early
>> is
>>>>>>> acceptable, then it may be a winner here, by a very
small margin.
>>>>>>> link-as-PTX may be better conceptually as it keeps
linking and
>>>>>> compilation
>>>>>>> separate.
>>>>>>>
>>>>>>> As for the practical steps, here's what we
need:
>>>>>>> - allow libcall lowering in NVPTX, possibly guarded
by a flag. This
>> is
>>>>>>> needed for all of the approaches above.
>>>>>>> - teach LLVM how to link in bitcode (and, possibly,
control
>> early/late
>>>>>> mode)
>>>>>>> - teach clang driver to delegate libdevice linking
to LLVM.
>>>>>>>
>>>>>>> This will allow us to experiment with all three
approaches and see
>> what
>>>>>>> works best.
>>>>>> I think if we embed knowledge about the nv_XXX
functions we can
>>>>>> even get away without the cons you listed for early
linking above.
>>>>>>
>>>>> WDYM by `embed knowledge about the nv_XXX functions`? By
linking those
>>>>> functions in? Of do you mean that we should just declare
them
>>>>> before/instead of linking libdevice in?
>>>> I mean by providing the "libcall lowering" pass. So
the knowledge
>>>> that llvm.cos maps to __nv_cos.
>>>>
>>>>>> For early link I'm assuming an order similar to [0]
but I also discuss
>>>>>> the case where we don't link libdevice early for a
TU.
>>>>>>
>>>>> That link just describes the steps needed to use libdevice.
It does not
>>>>> deal with how/where it fits in the LLVM pipeline.
>>>>> The gist is that NVVMreflect replaces some conditionals
with constants.
>>>>> libdevice uses that as a poor man's IR preprocessor,
conditionally
>>>> enabling
>>>>> different implementations and relying on DCE and constant
folding to
>>>> remove
>>>>> unused parts and eliminate the now useless branches.
>>>>> While running NVVM alone will make libdevice code valid and
usable, it
>>>>> would still benefit from further optimizations. I do not
know to what
>>>>> degree, though.
>>>>>
>>>>>
>>>>>> Link early:
>>>>>> 1) clang emits module.bc and links in libdevice.bc but
with the
>>>>>>        `optnone`, `noinline`, and "used"
attribute for functions in
>>>>>>        libdevice. ("used" is not an attribute
but could as well be.)
>>>>>>        At this stage module.bc might call __nv_XXX or
llvm.XXX freely
>>>>>>        as defined by -ffast-math and friends.
>>>>>>
>>>>> That could work. Just carrying extra IR around would
probably be OK.
>>>>> We may want to do NVVMReflect as soon as we have it linked
in and,
>> maybe,
>>>>> allow optimizing the functions that are explicitly used
already.
>>>> Right. NVVMReflect can be run twice and with `alwaysinline`
>>>> on the call sites of __nv_XXX functions we will actually
>>>> inline and optimize them while the definitions are just
"dragged
>>>> along" in case we need them later.
>>>>
>>>>
>>>>>> 2) Run some optimizations in the middle end, maybe till
the end of
>>>>>>        the inliner loop, unsure.
>>>>>> 3) Run a libcall lowering pass and another NVVMReflect
pass (or the
>>>>>>        only instance thereof). We effectively remove
all llvm.XXX calls
>>>>>        in favor of __nv_XXX now. Note that we haven't
spend (much) time
>>>>>>        on the libdevice code as it is optnone and most
passes are good
>>>>>>        at skipping those. To me, it's unclear if
the used parts should
>>>>>>        not be optimized before we inline them anyway to
avoid redoing
>>>>>>        the optimizations over and over (per call site).
That needs
>>>>>>        measuring I guess. Also note that we can still
retain the
>> current
>>>>>>        behavior for direct calls to __nv_XXX if we mark
the call sites
>>>>>>        as `alwaysinline`, or at least the behavior is
almost like the
>>>>>>        current one is.
>>>>>> 4) Run an always inliner pass on the __nv_XXX calls
because it is
>>>>>>        something we would do right now. Alternatively,
remove `optnone`
>>>>>>        and `noinline` from the __nv_XXX calls.
>>>>>> 5) Continue with the pipeline as before.
>>>>>>
>>>>>>
>>>>> SGTM.
>>>>>
>>>>>
>>>>>> As mentioned above, `optnone` avoids spending time on
the libdevice
>>>>>> until we "activate" it. At that point
(globals) DCE can be scheduled
>>>>>> to remove all unused parts right away. I don't
think this is (much)
>>>>>> more expensive than linking libdevice early right now.
>>>>>>
>>>>>> Link late, aka. translation units without libdevice:
>>>>>> 1) clang emits module.bc but does not link in
libdevice.bc, it will be
>>>>>>        made available later. We still can mix __nv_XXX
and llvm.XXX
>> calls
>>>>>>        freely as above.
>>>>>> 2) Same as above.
>>>>>> 3) Same as above.
>>>>>> 4) Same as above but effectively a no-op, no __nv_XXX
definitions are
>>>>>>        available.
>>>>>> 5) Same as above.
>>>>>>
>>>>>>
>>>>>> I might misunderstand something about the current
pipeline but from
>> [0]
>>>>>> and the experiments I run locally it looks like the
above should cover
>>>> all
>>>>>> the cases. WDYT?
>>>>>>
>>>>>>
>>>>> The `optnone` trick may indeed remove much of the practical
differences
>>>>> between the early/late approaches.
>>>>> In principle it should work.
>>>>>
>>>>> Next question is -- is libdevice sufficient to satisfy
LLVM's
>> assumptions
>>>>> about the standard library.
>>>>> While it does provide most of the equivalents of libm
functions, the
>> set
>>>> is
>>>>> not complete and some of the functions differ from their
libm
>>>> counterparts.
>>>>> The differences are minor, so we should be able to deal
with it by
>>>>> generating few wrapper functions for the odd cases.
>>>>> Here's what clang does to provide math functions using
libdevice:
>>>>>
>>
https://github.com/llvm/llvm-project/blob/main/clang/lib/Headers/__clang_cuda_math.h
>>>> Right now, clang will generate any llvm intrinsic and we crash,
so
>> anything
>>>> else is probably a step in the right direction. Eventually, we
should
>>>> "lower"
>>>> all intrinsics that the NVPTX backend can't handle or at
least emit a
>> nice
>>>> error message. Preferably, clang would know what we can't
deal with and
>> not
>>>> generate intinsic calls for those in the first place.
>>>>
>>>>
>>>>> The most concerning aspect of libdevice is that we
don't know when
>> we'll
>>>> no
>>>>> longer be able to use the libdevice bitcode? My
understanding is that
>> IR
>>>>> does not guarantee binary stability and at some point we
may just be
>>>> unable
>>>>> to use it. Ideally we need our own libm for GPUs.
>>>> For OpenMP I did my best to avoid writing libm (code) for GPUs
by piggy
>>>> backing on CUDA and libc++ implementations, I hope it will stay
that
>> way.
>>>> That said, if the need arises we might really have to port
libc++ to the
>>>> GPUs.
>>>>
>>>> Back to the problem with libdevice. I agree that the solution
of NVIDIA
>>>> to ship a .bc library is suboptimal but with the existing, or
an
>> extended,
>>>> auto-upgrader we might be able to make that work reasonably
well for the
>>>> foreseeable future. That problem is orthogonal to what we are
discussing
>>>> above, I think.
>>>>
>>>> ~ Johannes
>>>>
>>>>
>>>>> --Artem
>>>>>
>>>>>
>>>>>> ~ Johannes
>>>>>>
>>>>>>
>>>>>> P.S. If the rewrite capability (aka libcall lowering)
is generic we
>>>> could
>>>>>>          use the scheme for many other things as well.
>>>>>>
>>>>>>
>>>>>> [0]
https://llvm.org/docs/NVPTXUsage.html#linking-with-libdevice
>>>>>>
>>>>>>
>>>>>>> --Artem
>>>>>>>
>>>>>>>
>>>>>>>> ~ Johannes
>>>>>>>>
>>>>>>>>
>>>>>>>>> One thing that may work within the existing
compilation model is to
>>>>>>>>> pre-compile the standard library into PTX
and then textually embed
>>>>>>>> relevant
>>>>>>>>> functions into the generated PTX, thus
pushing the 'linking' phase
>>>> past
>>>>>>>> the
>>>>>>>>> end of LLVM's compilation and make it
look closer to the standard
>>>>>>>>> compile/link process. This way we'd
only enable libcall lowering in
>>>>>>>> NVPTX,
>>>>>>>>> assuming that the library functions will be
magically available out
>>>>>>>> there.
>>>>>>>>> Injection of PTX could be done with an
external script outside of
>>>> LLVM
>>>>>>>> and
>>>>>>>>> it could be incorporated into clang driver.
Bonus points for the
>> fact
>>>>>>>> that
>>>>>>>>> this scheme is compatible with -fgpu-rdc
out of the box -- assemble
>>>> the
>>>>>>>> PTX
>>>>>>>>> with `ptxas -rdc` and then actually link
with the library, instead
>> of
>>>>>>>>> injecting its PTX before invoking ptxas.
>>>>>>>>>
>>>>>>>>> --Artem
>>>>>>>>>
>>>>>>>>> Trying to figure out a good way to have the
cake and eat it too.
>>>>>>>>>> ~ Johannes
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 3/10/21 2:49 PM, William Moses
wrote:
>>>>>>>>>>> Since clang (and arguably any other
frontend that uses) should
>> link
>>>>>> in
>>>>>>>>>>> libdevice, could we lower these
intrinsics to the libdevice code?
>>>>>>>>> The linking happens *before* LLVM gets to
work on IR.
>>>>>>>>> As I said, it's a workaround, not the
solution. It's possible for
>>>> LLVM
>>>>>> to
>>>>>>>>> still attempt lowering something in the IR
into a libcall and we
>>>> would
>>>>>>>> not
>>>>>>>>> be able to deal with that. It happens to
work well enough in
>>>> practice.
>>>>>>>>> Do you have an example where you see the
problem with -ffast-math?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>> For example, consider compiling the
simple device function below:
>>>>>>>>>>>
>>>>>>>>>>> ```
>>>>>>>>>>> //
/mnt/sabrent/wmoses/llvm13/build/bin/clang tmp.cu -S
>> -emit-llvm
>>>>>>>>>>>       
--cuda-path=/usr/local/cuda-11.0
>> -L/usr/local/cuda-11.0/lib64
>>>>>>>>>>> --cuda-gpu-arch=sm_37
>>>>>>>>>>> __device__ double f(double x) {
>>>>>>>>>>>           return cos(x);
>>>>>>>>>>> }
>>>>>>>>>>> ```
>>>>>>>>>>>
>>>>>>>>>>> The LLVM module for it is as
follows:
>>>>>>>>>>>
>>>>>>>>>>> ```
>>>>>>>>>>> ...
>>>>>>>>>>> define dso_local double
@_Z1fd(double %x) #0 {
>>>>>>>>>>> entry:
>>>>>>>>>>>         %__a.addr.i = alloca
double, align 8
>>>>>>>>>>>         %x.addr = alloca double,
align 8
>>>>>>>>>>>         store double %x, double*
%x.addr, align 8
>>>>>>>>>>>         %0 = load double, double*
%x.addr, align 8
>>>>>>>>>>>         store double %0, double*
%__a.addr.i, align 8
>>>>>>>>>>>         %1 = load double, double*
%__a.addr.i, align 8
>>>>>>>>>>>         %call.i = call contract
double @__nv_cos(double %1) #7
>>>>>>>>>>>         ret double %call.i
>>>>>>>>>>> }
>>>>>>>>>>>
>>>>>>>>>>> define internal double
@__nv_cos(double %a) #1 {
>>>>>>>>>>>         %q.i = alloca i32, align 4
>>>>>>>>>>> ```
>>>>>>>>>>>
>>>>>>>>>>> Obviously we would need to do
something to ensure these functions
>>>>>> don't
>>>>>>>>>> get
>>>>>>>>>>> deleted prior to their use in
lowering from intrinsic to
>> libdevice.
>>>>>>>>>>> ...
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Mar 10, 2021 at 3:39 PM
Artem Belevich <tra at google.com>
>>>>>> wrote:
>>>>>>>>>>>> On Wed, Mar 10, 2021 at 11:41
AM Johannes Doerfert <
>>>>>>>>>>>> johannesdoerfert at
gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Artem, Justin,
>>>>>>>>>>>>>
>>>>>>>>>>>>> I am running into a problem
and I'm curious if I'm missing
>>>>>> something
>>>>>>>> or
>>>>>>>>>>>>> if the support is simply
missing.
>>>>>>>>>>>>> Am I correct to assume the
NVPTX backend does not deal with
>>>>>>>> `llvm.sin`
>>>>>>>>>>>>> and friends?
>>>>>>>>>>>>>
>>>>>>>>>>>> Correct. It can't deal with
anything that may need to lower to a
>>>>>>>>>> standard
>>>>>>>>>>>> library call.
>>>>>>>>>>>>
>>>>>>>>>>>>> This is what I see, with
some variations:
>>>>>>>> https://godbolt.org/z/PxsEWs
>>>>>>>>>>>>> If this is missing in the
backend, is there a plan to get this
>>>>>>>> working,
>>>>>>>>>>>>> I'd really like to have
the
>>>>>>>>>>>>> intrinsics in the middle
end rather than __nv_cos, not to
>> mention
>>>>>>>> that
>>>>>>>>>>>>> -ffast-math does emit
intrinsics
>>>>>>>>>>>>> and crashes.
>>>>>>>>>>>>>
>>>>>>>>>>>> It all boils down to the fact
that PTX does not have the
>> standard
>>>>>>>>>>>> libc/libm which LLVM could
lower the calls to, nor does it have
>> a
>>>>>>>>>> 'linking'
>>>>>>>>>>>> phase where we could link such
a library in, if we had it.
>>>>>>>>>>>>
>>>>>>>>>>>> Libdevice bitcode does provide
the implementations for some of
>> the
>>>>>>>>>>>> functions (though with a __nv_
prefix) and clang links it in in
>>>>>> order
>>>>>>>> to
>>>>>>>>>>>> avoid generating IR that LLVM
can't handle, but that's a
>>>> workaround
>>>>>>>> that
>>>>>>>>>>>> does not help LLVM itself.
>>>>>>>>>>>>
>>>>>>>>>>>> --Artem
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> ~ Johannes
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> ───────────────────
>>>>>>>>>>>>> ∽ Johannes (he/his)
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> --Artem Belevich
>>>>>>>>>>>>
>

llvm dev - Mar 2021 - NVPTX codegen for llvm.sin (and friends)

[llvm-dev] NVPTX codegen for llvm.sin (and friends)

[llvm-dev] NVPTX codegen for llvm.sin (and friends)