thr3ads.net - llvm dev - [llvm-dev] NVPTX codegen for llvm.sin (and friends) [Mar 2021]

If this information is useful, please help other people find it:
Share via:

William Moses via llvm-dev

2021-Mar-11 02:44 UTC

[llvm-dev] NVPTX codegen for llvm.sin (and friends)

We could also consider doing something slightly broader.

For example we could define a special attribute on top of the llvm.cos
call/declaration etc with metadata or an attribute that points to the
actual __nv_cos function. Then in a subsequent lowering pass the
corresponding intrinsic with the relevant attribute has its uses replaced
by the actual function.


On Wed, Mar 10, 2021 at 7:57 PM Johannes Doerfert <
johannesdoerfert at gmail.com> wrote:
>
> On 3/10/21 6:22 PM, Artem Belevich wrote:
> > On Wed, Mar 10, 2021 at 3:44 PM Johannes Doerfert <
> > johannesdoerfert at gmail.com> wrote:
> >
> >> On 3/10/21 4:38 PM, Artem Belevich wrote:
> >>> On Wed, Mar 10, 2021 at 1:55 PM Johannes Doerfert <
> >>> johannesdoerfert at gmail.com> wrote:
> >>>
> >>>> On 3/10/21 3:25 PM, Artem Belevich wrote:
> >>>>> On Wed, Mar 10, 2021 at 12:57 PM Johannes Doerfert
<
> >>>>> johannesdoerfert at gmail.com> wrote:
> >>>>>
> >>>>>> Right. We could keep the definition of __nv_cos
and friends
> >>>>>> around. Right now, -ffast-math might just crash on
the user,
> >>>>>> which is arguably a bad thing. I can also see us
benefiting
> >>>>>> in various other ways from llvm.cos uses instead
of __nv_cos
> >>>>>> (assuming precision is according to the user
requirements but
> >>>>>> that is always a condition).
> >>>>>>
> >>>>>> It could be as simple as introducing __nv_cos into
> >>>>>> "llvm.used" and a backend
matching/rewrite pass.
> >>>>>>
> >>>>>> If the backend knew the libdevice location it
could even pick
> >>>>>> the definitions from there. Maybe we could link
libdevice late
> >>>>>> instead of eager?
> >>>>>>
> >>>>> It's possible, but it would require plumbing in
CUDA SDK awareness
> into
> >>>>> LLVM. While clang driver can deal with that, LLVM
currently can't.
> The
> >>>>> bitcode library path would have to be provided by the
user.
> >>>> The PTX backend could arguably be CUDA SDK aware, IMHO, it
would
> >>>> even be fine if the middle-end does the remapping to get
inlining
> >>>> and folding benefits also after __nv_cos is used. See
below.
> >>>>
> >>>>
> >>>>> The standard library as bitcode raises some questions.
> >>>> Which standard library? CUDAs libdevice is a bitcode
library, right?
> >>>>
> >>> It's whatever LLVM will need to lower libcalls to.
libdevice bitcode is
> >> the
> >>> closest approximation of that we have at the moment.
> >>>
> >>>
> >>>>> * When do we want to do the linking? If we do it at
the beginning,
> then
> >>>> the
> >>>>> question is how to make sure unused functions are not
eliminated
> before
> >>>> we
> >>>>> may need them, as we don't know apriori what's
going to be needed. We
> >>>> also
> >>>>> do want the unused functions to be gone after
we're done. Linking it
> in
> >>>>> early would allow optimizing the code better at the
expense of having
> >> to
> >>>>> optimize a lot of code we'll throw away. Linking
it in late has less
> >>>>> overhead, but leaves the linked in bitcode
unoptimized, though it's
> >>>>> probably in the ballpark of what would happen with a
real library
> call.
> >>>>> I.e. no inlining, etc.
> >>>>>
> >>>>> * It incorporates linking into LLVM, which is not
LLVM's job.
> Arguably,
> >>>> the
> >>>>> line should be drawn at the lowering to libcalls as
it's done for
> other
> >>>>> back-ends. However, we're also constrained to by
the need to have the
> >>>>> linking done before we generate PTX which prevents
doing it after
> LLVM
> >> is
> >>>>> done generating an object file.
> >>>> I'm confused. Clang links in libdevice.bc early.
> >>> Yes. Because that's where it has to happen if we want to
keep LLVM
> >> unaware
> >>> of CUDA SDK.
> >>> It does not have to be the case if/when LLVM can do the
linking itself.
> >>>
> >>>
> >>>> If we make sure
> >>>> `__nv_cos` is not deleted early, we can at any point
"lower"
> `llvm.cos`
> >>>> to `__nv_cos` which is available. After the lowering we
can remove
> >>>> the artificial uses of `__nv_XXX` functions that we used
to keep the
> >>>> definitions around in order to remove them from the final
result.
> >>>>
> >>> This is the 'link early' approach, I should've
been explicit that it's
> >>> 'link early *everything*' as opposed to linking only
what's needed at
> the
> >>> beginning.
> >>> It would work at the expense of having to process/optimize
500KB worth
> of
> >>> bitcode for every compilation, whether it needs it or not.
> >>>
> >>>
> >>>> We get the benefit of having `llvm.cos` for some of the
pipeline,
> >>>> we know it does not have all the bad effects while
`__nv_cos` is
> defined
> >>>> with inline assembly. We also get the benefit of inlining
`__nv_cos`
> >>>> and folding the implementation based on the arguments.
Finally,
> >>>> this should work with the existing pipeline, the linking
is the same
> >>>> as before, all we do is to keep the definitions alive
longer and
> >>>> lower `llvm.cos` to `__nv_cos` in a middle end pass.
> >>>>
> >>> Again, I agree that it is doable.
> >>>
> >>>
> >>>
> >>>> This might be similar to the PTX solution you describe
below but I
> feel
> >>>> we get the inline benefit from this without actually
changing the
> >> pipeline
> >>>> at all.
> >>>>
> >>> So, to summarize:
> >>> * link the library as bitcode early, add artificial
placeholders for
> >>> everything, compile, remove placeholders and DCE unused stuff
away.
> >>>     Pros:
> >>>        - we're already doing most of it before clang hands
hands off
> IR to
> >>> LLVM, so it just pushes it a bit lower in the compilation.
> >>>     Cons:
> >>>        - runtime cost of optimizing libdevice bitcode,
> >>>        - libdevice may be required for all NVPTX compilations?
> >>>
> >>> * link the library as bitcode late.
> >>>      Pros:
> >>>        - lower runtime cost than link-early approach.
> >>>      Cons:
> >>>        - We'll need to make sure that NVVMReflect pass
processes the
> >> library.
> >>>        - less optimizations on the library functions. Some of
the code
> >> gets
> >>> DCE'ed away after NVVMReflect and the rest could be
optimized better.
> >>>        - libdevice may be required for all NVPTX compilations?
> >>> * 'link' with the library as PTX appended as text to
LLVM's output and
> >> let
> >>> ptxas do the 'linking'
> >>>     Pros:  LLVM remains agnostic of CUDA SDK installation
details. All
> it
> >>> does is allows lowering libcalls and leaves their resolution
to the
> >>> external tools.
> >>>     Cons: Need to have the PTX library somewhere and need to
integrate
> the
> >>> 'linking' into the compilation process somehow.
> >>>
> >>> Neither is particularly good. If the runtime overhead of
link-early is
> >>> acceptable, then it may be a winner here, by a very small
margin.
> >>> link-as-PTX may be better conceptually as it keeps linking and
> >> compilation
> >>> separate.
> >>>
> >>> As for the practical steps, here's what we need:
> >>> - allow libcall lowering in NVPTX, possibly guarded by a flag.
This is
> >>> needed for all of the approaches above.
> >>> - teach LLVM how to link in bitcode (and, possibly, control
early/late
> >> mode)
> >>> - teach clang driver to delegate libdevice linking to LLVM.
> >>>
> >>> This will allow us to experiment with all three approaches and
see what
> >>> works best.
> >> I think if we embed knowledge about the nv_XXX functions we can
> >> even get away without the cons you listed for early linking above.
> >>
> > WDYM by `embed knowledge about the nv_XXX functions`? By linking those
> > functions in? Of do you mean that we should just declare them
> > before/instead of linking libdevice in?
> I mean by providing the "libcall lowering" pass. So the knowledge
> that llvm.cos maps to __nv_cos.
>
> >
> >
> >> For early link I'm assuming an order similar to [0] but I also
discuss
> >> the case where we don't link libdevice early for a TU.
> >>
> > That link just describes the steps needed to use libdevice. It does
not
> > deal with how/where it fits in the LLVM pipeline.
> > The gist is that NVVMreflect replaces some conditionals with
constants.
> > libdevice uses that as a poor man's IR preprocessor, conditionally
> enabling
> > different implementations and relying on DCE and constant folding to
> remove
> > unused parts and eliminate the now useless branches.
> > While running NVVM alone will make libdevice code valid and usable, it
> > would still benefit from further optimizations. I do not know to what
> > degree, though.
> >
> >
> >> Link early:
> >> 1) clang emits module.bc and links in libdevice.bc but with the
> >>      `optnone`, `noinline`, and "used" attribute for
functions in
> >>      libdevice. ("used" is not an attribute but could as
well be.)
> >>      At this stage module.bc might call __nv_XXX or llvm.XXX
freely
> >>      as defined by -ffast-math and friends.
> >>
> > That could work. Just carrying extra IR around would probably be OK.
> > We may want to do NVVMReflect as soon as we have it linked in and,
maybe,
> > allow optimizing the functions that are explicitly used already.
>
> Right. NVVMReflect can be run twice and with `alwaysinline`
> on the call sites of __nv_XXX functions we will actually
> inline and optimize them while the definitions are just "dragged
> along" in case we need them later.
>
>
> >> 2) Run some optimizations in the middle end, maybe till the end of
> >>      the inliner loop, unsure.
> >> 3) Run a libcall lowering pass and another NVVMReflect pass (or
the
> >>      only instance thereof). We effectively remove all llvm.XXX
calls
> >      in favor of __nv_XXX now. Note that we haven't spend (much)
time
> >>      on the libdevice code as it is optnone and most passes are
good
> >>      at skipping those. To me, it's unclear if the used parts
should
> >>      not be optimized before we inline them anyway to avoid
redoing
> >>      the optimizations over and over (per call site). That needs
> >>      measuring I guess. Also note that we can still retain the
current
> >>      behavior for direct calls to __nv_XXX if we mark the call
sites
> >>      as `alwaysinline`, or at least the behavior is almost like
the
> >>      current one is.
> >> 4) Run an always inliner pass on the __nv_XXX calls because it is
> >>      something we would do right now. Alternatively, remove
`optnone`
> >>      and `noinline` from the __nv_XXX calls.
> >> 5) Continue with the pipeline as before.
> >>
> >>
> > SGTM.
> >
> >
> >> As mentioned above, `optnone` avoids spending time on the
libdevice
> >> until we "activate" it. At that point (globals) DCE can
be scheduled
> >> to remove all unused parts right away. I don't think this is
(much)
> >> more expensive than linking libdevice early right now.
> >>
> >> Link late, aka. translation units without libdevice:
> >> 1) clang emits module.bc but does not link in libdevice.bc, it
will be
> >>      made available later. We still can mix __nv_XXX and llvm.XXX
calls
> >>      freely as above.
> >> 2) Same as above.
> >> 3) Same as above.
> >> 4) Same as above but effectively a no-op, no __nv_XXX definitions
are
> >>      available.
> >> 5) Same as above.
> >>
> >>
> >> I might misunderstand something about the current pipeline but
from [0]
> >> and the experiments I run locally it looks like the above should
cover
> all
> >> the cases. WDYT?
> >>
> >>
> > The `optnone` trick may indeed remove much of the practical
differences
> > between the early/late approaches.
> > In principle it should work.
> >
> > Next question is -- is libdevice sufficient to satisfy LLVM's
assumptions
> > about the standard library.
> > While it does provide most of the equivalents of libm functions, the
set
> is
> > not complete and some of the functions differ from their libm
> counterparts.
> > The differences are minor, so we should be able to deal with it by
> > generating few wrapper functions for the odd cases.
> > Here's what clang does to provide math functions using libdevice:
> >
>
https://github.com/llvm/llvm-project/blob/main/clang/lib/Headers/__clang_cuda_math.h
>
> Right now, clang will generate any llvm intrinsic and we crash, so anything
> else is probably a step in the right direction. Eventually, we should
> "lower"
> all intrinsics that the NVPTX backend can't handle or at least emit a
nice
> error message. Preferably, clang would know what we can't deal with and
not
> generate intinsic calls for those in the first place.
>
>
> >
> > The most concerning aspect of libdevice is that we don't know when
we'll
> no
> > longer be able to use the libdevice bitcode? My understanding is that
IR
> > does not guarantee binary stability and at some point we may just be
> unable
> > to use it. Ideally we need our own libm for GPUs.
>
> For OpenMP I did my best to avoid writing libm (code) for GPUs by piggy
> backing on CUDA and libc++ implementations, I hope it will stay that way.
> That said, if the need arises we might really have to port libc++ to the
> GPUs.
>
> Back to the problem with libdevice. I agree that the solution of NVIDIA
> to ship a .bc library is suboptimal but with the existing, or an extended,
> auto-upgrader we might be able to make that work reasonably well for the
> foreseeable future. That problem is orthogonal to what we are discussing
> above, I think.
>
> ~ Johannes
>
>
> >
> > --Artem
> >
> >
> >> ~ Johannes
> >>
> >>
> >> P.S. If the rewrite capability (aka libcall lowering) is generic
we
> could
> >>        use the scheme for many other things as well.
> >>
> >>
> >> [0] https://llvm.org/docs/NVPTXUsage.html#linking-with-libdevice
> >>
> >>
> >>> --Artem
> >>>
> >>>
> >>>> ~ Johannes
> >>>>
> >>>>
> >>>>> One thing that may work within the existing
compilation model is to
> >>>>> pre-compile the standard library into PTX and then
textually embed
> >>>> relevant
> >>>>> functions into the generated PTX, thus pushing the
'linking' phase
> past
> >>>> the
> >>>>> end of LLVM's compilation and make it look closer
to the standard
> >>>>> compile/link process. This way we'd only enable
libcall lowering in
> >>>> NVPTX,
> >>>>> assuming that the library functions will be magically
available out
> >>>> there.
> >>>>> Injection of PTX could be done with an external script
outside of
> LLVM
> >>>> and
> >>>>> it could be incorporated into clang driver. Bonus
points for the fact
> >>>> that
> >>>>> this scheme is compatible with -fgpu-rdc out of the
box -- assemble
> the
> >>>> PTX
> >>>>> with `ptxas -rdc` and then actually link with the
library, instead of
> >>>>> injecting its PTX before invoking ptxas.
> >>>>>
> >>>>> --Artem
> >>>>>
> >>>>> Trying to figure out a good way to have the cake and
eat it too.
> >>>>>> ~ Johannes
> >>>>>>
> >>>>>>
> >>>>>> On 3/10/21 2:49 PM, William Moses wrote:
> >>>>>>> Since clang (and arguably any other frontend
that uses) should link
> >> in
> >>>>>>> libdevice, could we lower these intrinsics to
the libdevice code?
> >>>>> The linking happens *before* LLVM gets to work on IR.
> >>>>> As I said, it's a workaround, not the solution.
It's possible for
> LLVM
> >> to
> >>>>> still attempt lowering something in the IR into a
libcall and we
> would
> >>>> not
> >>>>> be able to deal with that. It happens to work well
enough in
> practice.
> >>>>>
> >>>>> Do you have an example where you see the problem with
-ffast-math?
> >>>>>
> >>>>>
> >>>>>
> >>>>>>> For example, consider compiling the simple
device function below:
> >>>>>>>
> >>>>>>> ```
> >>>>>>> // /mnt/sabrent/wmoses/llvm13/build/bin/clang
tmp.cu -S -emit-llvm
> >>>>>>>      --cuda-path=/usr/local/cuda-11.0
-L/usr/local/cuda-11.0/lib64
> >>>>>>> --cuda-gpu-arch=sm_37
> >>>>>>> __device__ double f(double x) {
> >>>>>>>         return cos(x);
> >>>>>>> }
> >>>>>>> ```
> >>>>>>>
> >>>>>>> The LLVM module for it is as follows:
> >>>>>>>
> >>>>>>> ```
> >>>>>>> ...
> >>>>>>> define dso_local double @_Z1fd(double %x) #0 {
> >>>>>>> entry:
> >>>>>>>       %__a.addr.i = alloca double, align 8
> >>>>>>>       %x.addr = alloca double, align 8
> >>>>>>>       store double %x, double* %x.addr, align
8
> >>>>>>>       %0 = load double, double* %x.addr, align
8
> >>>>>>>       store double %0, double* %__a.addr.i,
align 8
> >>>>>>>       %1 = load double, double* %__a.addr.i,
align 8
> >>>>>>>       %call.i = call contract double
@__nv_cos(double %1) #7
> >>>>>>>       ret double %call.i
> >>>>>>> }
> >>>>>>>
> >>>>>>> define internal double @__nv_cos(double %a) #1
{
> >>>>>>>       %q.i = alloca i32, align 4
> >>>>>>> ```
> >>>>>>>
> >>>>>>> Obviously we would need to do something to
ensure these functions
> >> don't
> >>>>>> get
> >>>>>>> deleted prior to their use in lowering from
intrinsic to libdevice.
> >>>>>>> ...
> >>>>>>>
> >>>>>>>
> >>>>>>> On Wed, Mar 10, 2021 at 3:39 PM Artem Belevich
<tra at google.com>
> >> wrote:
> >>>>>>>> On Wed, Mar 10, 2021 at 11:41 AM Johannes
Doerfert <
> >>>>>>>> johannesdoerfert at gmail.com> wrote:
> >>>>>>>>
> >>>>>>>>> Artem, Justin,
> >>>>>>>>>
> >>>>>>>>> I am running into a problem and
I'm curious if I'm missing
> >> something
> >>>> or
> >>>>>>>>> if the support is simply missing.
> >>>>>>>>> Am I correct to assume the NVPTX
backend does not deal with
> >>>> `llvm.sin`
> >>>>>>>>> and friends?
> >>>>>>>>>
> >>>>>>>> Correct. It can't deal with anything
that may need to lower to a
> >>>>>> standard
> >>>>>>>> library call.
> >>>>>>>>
> >>>>>>>>> This is what I see, with some
variations:
> >>>> https://godbolt.org/z/PxsEWs
> >>>>>>>>> If this is missing in the backend, is
there a plan to get this
> >>>> working,
> >>>>>>>>> I'd really like to have the
> >>>>>>>>> intrinsics in the middle end rather
than __nv_cos, not to mention
> >>>> that
> >>>>>>>>> -ffast-math does emit intrinsics
> >>>>>>>>> and crashes.
> >>>>>>>>>
> >>>>>>>> It all boils down to the fact that PTX
does not have the standard
> >>>>>>>> libc/libm which LLVM could lower the calls
to, nor does it have a
> >>>>>> 'linking'
> >>>>>>>> phase where we could link such a library
in, if we had it.
> >>>>>>>>
> >>>>>>>> Libdevice bitcode does provide the
implementations for some of the
> >>>>>>>> functions (though with a __nv_ prefix) and
clang links it in in
> >> order
> >>>> to
> >>>>>>>> avoid generating IR that LLVM can't
handle, but that's a
> workaround
> >>>> that
> >>>>>>>> does not help LLVM itself.
> >>>>>>>>
> >>>>>>>> --Artem
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>> ~ Johannes
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> --
> >>>>>>>>> ───────────────────
> >>>>>>>>> ∽ Johannes (he/his)
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>> --
> >>>>>>>> --Artem Belevich
> >>>>>>>>
> >
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20210310/5ba357eb/attachment.html>

Johannes Doerfert via llvm-dev

2021-Mar-11 18:54 UTC

head link

[llvm-dev] NVPTX codegen for llvm.sin (and friends)

I certainly agree we should try to avoid a hard-coded mapping
in C++.

I could see something like:

```
__attribute__((implementation("llvm.cos"))
double __nv_cos(...) { ... }

```

and a pass that transforms all calls to a function with an
"implementation" to calls to that implementation. Maybe
later we attach a score/priority ;)



On 3/10/21 8:44 PM, William Moses wrote:> We could also consider doing something slightly broader.
>
> For example we could define a special attribute on top of the llvm.cos
> call/declaration etc with metadata or an attribute that points to the
> actual __nv_cos function. Then in a subsequent lowering pass the
> corresponding intrinsic with the relevant attribute has its uses replaced
> by the actual function.
>
>
> On Wed, Mar 10, 2021 at 7:57 PM Johannes Doerfert <
> johannesdoerfert at gmail.com> wrote:
>
>> On 3/10/21 6:22 PM, Artem Belevich wrote:
>>> On Wed, Mar 10, 2021 at 3:44 PM Johannes Doerfert <
>>> johannesdoerfert at gmail.com> wrote:
>>>
>>>> On 3/10/21 4:38 PM, Artem Belevich wrote:
>>>>> On Wed, Mar 10, 2021 at 1:55 PM Johannes Doerfert <
>>>>> johannesdoerfert at gmail.com> wrote:
>>>>>
>>>>>> On 3/10/21 3:25 PM, Artem Belevich wrote:
>>>>>>> On Wed, Mar 10, 2021 at 12:57 PM Johannes Doerfert
<
>>>>>>> johannesdoerfert at gmail.com> wrote:
>>>>>>>
>>>>>>>> Right. We could keep the definition of __nv_cos
and friends
>>>>>>>> around. Right now, -ffast-math might just crash
on the user,
>>>>>>>> which is arguably a bad thing. I can also see
us benefiting
>>>>>>>> in various other ways from llvm.cos uses
instead of __nv_cos
>>>>>>>> (assuming precision is according to the user
requirements but
>>>>>>>> that is always a condition).
>>>>>>>>
>>>>>>>> It could be as simple as introducing __nv_cos
into
>>>>>>>> "llvm.used" and a backend
matching/rewrite pass.
>>>>>>>>
>>>>>>>> If the backend knew the libdevice location it
could even pick
>>>>>>>> the definitions from there. Maybe we could link
libdevice late
>>>>>>>> instead of eager?
>>>>>>>>
>>>>>>> It's possible, but it would require plumbing in
CUDA SDK awareness
>> into
>>>>>>> LLVM. While clang driver can deal with that, LLVM
currently can't.
>> The
>>>>>>> bitcode library path would have to be provided by
the user.
>>>>>> The PTX backend could arguably be CUDA SDK aware, IMHO,
it would
>>>>>> even be fine if the middle-end does the remapping to
get inlining
>>>>>> and folding benefits also after __nv_cos is used. See
below.
>>>>>>
>>>>>>
>>>>>>> The standard library as bitcode raises some
questions.
>>>>>> Which standard library? CUDAs libdevice is a bitcode
library, right?
>>>>>>
>>>>> It's whatever LLVM will need to lower libcalls to.
libdevice bitcode is
>>>> the
>>>>> closest approximation of that we have at the moment.
>>>>>
>>>>>
>>>>>>> * When do we want to do the linking? If we do it at
the beginning,
>> then
>>>>>> the
>>>>>>> question is how to make sure unused functions are
not eliminated
>> before
>>>>>> we
>>>>>>> may need them, as we don't know apriori
what's going to be needed. We
>>>>>> also
>>>>>>> do want the unused functions to be gone after
we're done. Linking it
>> in
>>>>>>> early would allow optimizing the code better at the
expense of having
>>>> to
>>>>>>> optimize a lot of code we'll throw away.
Linking it in late has less
>>>>>>> overhead, but leaves the linked in bitcode
unoptimized, though it's
>>>>>>> probably in the ballpark of what would happen with
a real library
>> call.
>>>>>>> I.e. no inlining, etc.
>>>>>>>
>>>>>>> * It incorporates linking into LLVM, which is not
LLVM's job.
>> Arguably,
>>>>>> the
>>>>>>> line should be drawn at the lowering to libcalls as
it's done for
>> other
>>>>>>> back-ends. However, we're also constrained to
by the need to have the
>>>>>>> linking done before we generate PTX which prevents
doing it after
>> LLVM
>>>> is
>>>>>>> done generating an object file.
>>>>>> I'm confused. Clang links in libdevice.bc early.
>>>>> Yes. Because that's where it has to happen if we want
to keep LLVM
>>>> unaware
>>>>> of CUDA SDK.
>>>>> It does not have to be the case if/when LLVM can do the
linking itself.
>>>>>
>>>>>
>>>>>> If we make sure
>>>>>> `__nv_cos` is not deleted early, we can at any point
"lower"
>> `llvm.cos`
>>>>>> to `__nv_cos` which is available. After the lowering we
can remove
>>>>>> the artificial uses of `__nv_XXX` functions that we
used to keep the
>>>>>> definitions around in order to remove them from the
final result.
>>>>>>
>>>>> This is the 'link early' approach, I should've
been explicit that it's
>>>>> 'link early *everything*' as opposed to linking
only what's needed at
>> the
>>>>> beginning.
>>>>> It would work at the expense of having to process/optimize
500KB worth
>> of
>>>>> bitcode for every compilation, whether it needs it or not.
>>>>>
>>>>>
>>>>>> We get the benefit of having `llvm.cos` for some of the
pipeline,
>>>>>> we know it does not have all the bad effects while
`__nv_cos` is
>> defined
>>>>>> with inline assembly. We also get the benefit of
inlining `__nv_cos`
>>>>>> and folding the implementation based on the arguments.
Finally,
>>>>>> this should work with the existing pipeline, the
linking is the same
>>>>>> as before, all we do is to keep the definitions alive
longer and
>>>>>> lower `llvm.cos` to `__nv_cos` in a middle end pass.
>>>>>>
>>>>> Again, I agree that it is doable.
>>>>>
>>>>>
>>>>>
>>>>>> This might be similar to the PTX solution you describe
below but I
>> feel
>>>>>> we get the inline benefit from this without actually
changing the
>>>> pipeline
>>>>>> at all.
>>>>>>
>>>>> So, to summarize:
>>>>> * link the library as bitcode early, add artificial
placeholders for
>>>>> everything, compile, remove placeholders and DCE unused
stuff away.
>>>>>      Pros:
>>>>>         - we're already doing most of it before clang
hands hands off
>> IR to
>>>>> LLVM, so it just pushes it a bit lower in the compilation.
>>>>>      Cons:
>>>>>         - runtime cost of optimizing libdevice bitcode,
>>>>>         - libdevice may be required for all NVPTX
compilations?
>>>>>
>>>>> * link the library as bitcode late.
>>>>>       Pros:
>>>>>         - lower runtime cost than link-early approach.
>>>>>       Cons:
>>>>>         - We'll need to make sure that NVVMReflect pass
processes the
>>>> library.
>>>>>         - less optimizations on the library functions. Some
of the code
>>>> gets
>>>>> DCE'ed away after NVVMReflect and the rest could be
optimized better.
>>>>>         - libdevice may be required for all NVPTX
compilations?
>>>>> * 'link' with the library as PTX appended as text
to LLVM's output and
>>>> let
>>>>> ptxas do the 'linking'
>>>>>      Pros:  LLVM remains agnostic of CUDA SDK installation
details. All
>> it
>>>>> does is allows lowering libcalls and leaves their
resolution to the
>>>>> external tools.
>>>>>      Cons: Need to have the PTX library somewhere and need
to integrate
>> the
>>>>> 'linking' into the compilation process somehow.
>>>>>
>>>>> Neither is particularly good. If the runtime overhead of
link-early is
>>>>> acceptable, then it may be a winner here, by a very small
margin.
>>>>> link-as-PTX may be better conceptually as it keeps linking
and
>>>> compilation
>>>>> separate.
>>>>>
>>>>> As for the practical steps, here's what we need:
>>>>> - allow libcall lowering in NVPTX, possibly guarded by a
flag. This is
>>>>> needed for all of the approaches above.
>>>>> - teach LLVM how to link in bitcode (and, possibly, control
early/late
>>>> mode)
>>>>> - teach clang driver to delegate libdevice linking to LLVM.
>>>>>
>>>>> This will allow us to experiment with all three approaches
and see what
>>>>> works best.
>>>> I think if we embed knowledge about the nv_XXX functions we can
>>>> even get away without the cons you listed for early linking
above.
>>>>
>>> WDYM by `embed knowledge about the nv_XXX functions`? By linking
those
>>> functions in? Of do you mean that we should just declare them
>>> before/instead of linking libdevice in?
>> I mean by providing the "libcall lowering" pass. So the
knowledge
>> that llvm.cos maps to __nv_cos.
>>
>>>
>>>> For early link I'm assuming an order similar to [0] but I
also discuss
>>>> the case where we don't link libdevice early for a TU.
>>>>
>>> That link just describes the steps needed to use libdevice. It does
not
>>> deal with how/where it fits in the LLVM pipeline.
>>> The gist is that NVVMreflect replaces some conditionals with
constants.
>>> libdevice uses that as a poor man's IR preprocessor,
conditionally
>> enabling
>>> different implementations and relying on DCE and constant folding
to
>> remove
>>> unused parts and eliminate the now useless branches.
>>> While running NVVM alone will make libdevice code valid and usable,
it
>>> would still benefit from further optimizations. I do not know to
what
>>> degree, though.
>>>
>>>
>>>> Link early:
>>>> 1) clang emits module.bc and links in libdevice.bc but with the
>>>>       `optnone`, `noinline`, and "used" attribute for
functions in
>>>>       libdevice. ("used" is not an attribute but
could as well be.)
>>>>       At this stage module.bc might call __nv_XXX or llvm.XXX
freely
>>>>       as defined by -ffast-math and friends.
>>>>
>>> That could work. Just carrying extra IR around would probably be
OK.
>>> We may want to do NVVMReflect as soon as we have it linked in and,
maybe,
>>> allow optimizing the functions that are explicitly used already.
>> Right. NVVMReflect can be run twice and with `alwaysinline`
>> on the call sites of __nv_XXX functions we will actually
>> inline and optimize them while the definitions are just "dragged
>> along" in case we need them later.
>>
>>
>>>> 2) Run some optimizations in the middle end, maybe till the end
of
>>>>       the inliner loop, unsure.
>>>> 3) Run a libcall lowering pass and another NVVMReflect pass (or
the
>>>>       only instance thereof). We effectively remove all
llvm.XXX calls
>>>       in favor of __nv_XXX now. Note that we haven't spend
(much) time
>>>>       on the libdevice code as it is optnone and most passes
are good
>>>>       at skipping those. To me, it's unclear if the used
parts should
>>>>       not be optimized before we inline them anyway to avoid
redoing
>>>>       the optimizations over and over (per call site). That
needs
>>>>       measuring I guess. Also note that we can still retain the
current
>>>>       behavior for direct calls to __nv_XXX if we mark the call
sites
>>>>       as `alwaysinline`, or at least the behavior is almost
like the
>>>>       current one is.
>>>> 4) Run an always inliner pass on the __nv_XXX calls because it
is
>>>>       something we would do right now. Alternatively, remove
`optnone`
>>>>       and `noinline` from the __nv_XXX calls.
>>>> 5) Continue with the pipeline as before.
>>>>
>>>>
>>> SGTM.
>>>
>>>
>>>> As mentioned above, `optnone` avoids spending time on the
libdevice
>>>> until we "activate" it. At that point (globals) DCE
can be scheduled
>>>> to remove all unused parts right away. I don't think this
is (much)
>>>> more expensive than linking libdevice early right now.
>>>>
>>>> Link late, aka. translation units without libdevice:
>>>> 1) clang emits module.bc but does not link in libdevice.bc, it
will be
>>>>       made available later. We still can mix __nv_XXX and
llvm.XXX calls
>>>>       freely as above.
>>>> 2) Same as above.
>>>> 3) Same as above.
>>>> 4) Same as above but effectively a no-op, no __nv_XXX
definitions are
>>>>       available.
>>>> 5) Same as above.
>>>>
>>>>
>>>> I might misunderstand something about the current pipeline but
from [0]
>>>> and the experiments I run locally it looks like the above
should cover
>> all
>>>> the cases. WDYT?
>>>>
>>>>
>>> The `optnone` trick may indeed remove much of the practical
differences
>>> between the early/late approaches.
>>> In principle it should work.
>>>
>>> Next question is -- is libdevice sufficient to satisfy LLVM's
assumptions
>>> about the standard library.
>>> While it does provide most of the equivalents of libm functions,
the set
>> is
>>> not complete and some of the functions differ from their libm
>> counterparts.
>>> The differences are minor, so we should be able to deal with it by
>>> generating few wrapper functions for the odd cases.
>>> Here's what clang does to provide math functions using
libdevice:
>>>
>>
https://github.com/llvm/llvm-project/blob/main/clang/lib/Headers/__clang_cuda_math.h
>>
>> Right now, clang will generate any llvm intrinsic and we crash, so
anything
>> else is probably a step in the right direction. Eventually, we should
>> "lower"
>> all intrinsics that the NVPTX backend can't handle or at least emit
a nice
>> error message. Preferably, clang would know what we can't deal with
and not
>> generate intinsic calls for those in the first place.
>>
>>
>>> The most concerning aspect of libdevice is that we don't know
when we'll
>> no
>>> longer be able to use the libdevice bitcode? My understanding is
that IR
>>> does not guarantee binary stability and at some point we may just
be
>> unable
>>> to use it. Ideally we need our own libm for GPUs.
>> For OpenMP I did my best to avoid writing libm (code) for GPUs by piggy
>> backing on CUDA and libc++ implementations, I hope it will stay that
way.
>> That said, if the need arises we might really have to port libc++ to
the
>> GPUs.
>>
>> Back to the problem with libdevice. I agree that the solution of NVIDIA
>> to ship a .bc library is suboptimal but with the existing, or an
extended,
>> auto-upgrader we might be able to make that work reasonably well for
the
>> foreseeable future. That problem is orthogonal to what we are
discussing
>> above, I think.
>>
>> ~ Johannes
>>
>>
>>> --Artem
>>>
>>>
>>>> ~ Johannes
>>>>
>>>>
>>>> P.S. If the rewrite capability (aka libcall lowering) is
generic we
>> could
>>>>         use the scheme for many other things as well.
>>>>
>>>>
>>>> [0]
https://llvm.org/docs/NVPTXUsage.html#linking-with-libdevice
>>>>
>>>>
>>>>> --Artem
>>>>>
>>>>>
>>>>>> ~ Johannes
>>>>>>
>>>>>>
>>>>>>> One thing that may work within the existing
compilation model is to
>>>>>>> pre-compile the standard library into PTX and then
textually embed
>>>>>> relevant
>>>>>>> functions into the generated PTX, thus pushing the
'linking' phase
>> past
>>>>>> the
>>>>>>> end of LLVM's compilation and make it look
closer to the standard
>>>>>>> compile/link process. This way we'd only enable
libcall lowering in
>>>>>> NVPTX,
>>>>>>> assuming that the library functions will be
magically available out
>>>>>> there.
>>>>>>> Injection of PTX could be done with an external
script outside of
>> LLVM
>>>>>> and
>>>>>>> it could be incorporated into clang driver. Bonus
points for the fact
>>>>>> that
>>>>>>> this scheme is compatible with -fgpu-rdc out of the
box -- assemble
>> the
>>>>>> PTX
>>>>>>> with `ptxas -rdc` and then actually link with the
library, instead of
>>>>>>> injecting its PTX before invoking ptxas.
>>>>>>>
>>>>>>> --Artem
>>>>>>>
>>>>>>> Trying to figure out a good way to have the cake
and eat it too.
>>>>>>>> ~ Johannes
>>>>>>>>
>>>>>>>>
>>>>>>>> On 3/10/21 2:49 PM, William Moses wrote:
>>>>>>>>> Since clang (and arguably any other
frontend that uses) should link
>>>> in
>>>>>>>>> libdevice, could we lower these intrinsics
to the libdevice code?
>>>>>>> The linking happens *before* LLVM gets to work on
IR.
>>>>>>> As I said, it's a workaround, not the solution.
It's possible for
>> LLVM
>>>> to
>>>>>>> still attempt lowering something in the IR into a
libcall and we
>> would
>>>>>> not
>>>>>>> be able to deal with that. It happens to work well
enough in
>> practice.
>>>>>>> Do you have an example where you see the problem
with -ffast-math?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>> For example, consider compiling the simple
device function below:
>>>>>>>>>
>>>>>>>>> ```
>>>>>>>>> //
/mnt/sabrent/wmoses/llvm13/build/bin/clang tmp.cu -S -emit-llvm
>>>>>>>>>       --cuda-path=/usr/local/cuda-11.0
-L/usr/local/cuda-11.0/lib64
>>>>>>>>> --cuda-gpu-arch=sm_37
>>>>>>>>> __device__ double f(double x) {
>>>>>>>>>          return cos(x);
>>>>>>>>> }
>>>>>>>>> ```
>>>>>>>>>
>>>>>>>>> The LLVM module for it is as follows:
>>>>>>>>>
>>>>>>>>> ```
>>>>>>>>> ...
>>>>>>>>> define dso_local double @_Z1fd(double %x)
#0 {
>>>>>>>>> entry:
>>>>>>>>>        %__a.addr.i = alloca double, align 8
>>>>>>>>>        %x.addr = alloca double, align 8
>>>>>>>>>        store double %x, double* %x.addr,
align 8
>>>>>>>>>        %0 = load double, double* %x.addr,
align 8
>>>>>>>>>        store double %0, double*
%__a.addr.i, align 8
>>>>>>>>>        %1 = load double, double*
%__a.addr.i, align 8
>>>>>>>>>        %call.i = call contract double
@__nv_cos(double %1) #7
>>>>>>>>>        ret double %call.i
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>> define internal double @__nv_cos(double %a)
#1 {
>>>>>>>>>        %q.i = alloca i32, align 4
>>>>>>>>> ```
>>>>>>>>>
>>>>>>>>> Obviously we would need to do something to
ensure these functions
>>>> don't
>>>>>>>> get
>>>>>>>>> deleted prior to their use in lowering from
intrinsic to libdevice.
>>>>>>>>> ...
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Wed, Mar 10, 2021 at 3:39 PM Artem
Belevich <tra at google.com>
>>>> wrote:
>>>>>>>>>> On Wed, Mar 10, 2021 at 11:41 AM
Johannes Doerfert <
>>>>>>>>>> johannesdoerfert at gmail.com>
wrote:
>>>>>>>>>>
>>>>>>>>>>> Artem, Justin,
>>>>>>>>>>>
>>>>>>>>>>> I am running into a problem and
I'm curious if I'm missing
>>>> something
>>>>>> or
>>>>>>>>>>> if the support is simply missing.
>>>>>>>>>>> Am I correct to assume the NVPTX
backend does not deal with
>>>>>> `llvm.sin`
>>>>>>>>>>> and friends?
>>>>>>>>>>>
>>>>>>>>>> Correct. It can't deal with
anything that may need to lower to a
>>>>>>>> standard
>>>>>>>>>> library call.
>>>>>>>>>>
>>>>>>>>>>> This is what I see, with some
variations:
>>>>>> https://godbolt.org/z/PxsEWs
>>>>>>>>>>> If this is missing in the backend,
is there a plan to get this
>>>>>> working,
>>>>>>>>>>> I'd really like to have the
>>>>>>>>>>> intrinsics in the middle end rather
than __nv_cos, not to mention
>>>>>> that
>>>>>>>>>>> -ffast-math does emit intrinsics
>>>>>>>>>>> and crashes.
>>>>>>>>>>>
>>>>>>>>>> It all boils down to the fact that PTX
does not have the standard
>>>>>>>>>> libc/libm which LLVM could lower the
calls to, nor does it have a
>>>>>>>> 'linking'
>>>>>>>>>> phase where we could link such a
library in, if we had it.
>>>>>>>>>>
>>>>>>>>>> Libdevice bitcode does provide the
implementations for some of the
>>>>>>>>>> functions (though with a __nv_ prefix)
and clang links it in in
>>>> order
>>>>>> to
>>>>>>>>>> avoid generating IR that LLVM can't
handle, but that's a
>> workaround
>>>>>> that
>>>>>>>>>> does not help LLVM itself.
>>>>>>>>>>
>>>>>>>>>> --Artem
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> ~ Johannes
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> ───────────────────
>>>>>>>>>>> ∽ Johannes (he/his)
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> --Artem Belevich
>>>>>>>>>>

llvm dev - Mar 2021 - NVPTX codegen for llvm.sin (and friends)

[llvm-dev] NVPTX codegen for llvm.sin (and friends)

[llvm-dev] NVPTX codegen for llvm.sin (and friends)