thr3ads.net - llvm dev - [llvm-dev] NVPTX codegen for llvm.sin (and friends) [Mar 2021]

If this information is useful, please help other people find it:
Share via:

Artem Belevich via llvm-dev

2021-Mar-12 01:59 UTC

[llvm-dev] NVPTX codegen for llvm.sin (and friends)

On Thu, Mar 11, 2021 at 4:10 PM Johannes Doerfert <
johannesdoerfert at gmail.com> wrote:
>
> On 3/11/21 1:37 PM, Artem Belevich wrote:
> > On Thu, Mar 11, 2021 at 10:54 AM Johannes Doerfert <
> > johannesdoerfert at gmail.com> wrote:
> >
> >> I certainly agree we should try to avoid a hard-coded mapping
> >> in C++.
> >>
> >> I could see something like:
> >>
> >> ```
> >> __attribute__((implementation("llvm.cos"))
> >> double __nv_cos(...) { ... }
> >>
> >> ```
> >>
> >> and a pass that transforms all calls to a function with an
> >> "implementation" to calls to that implementation. Maybe
> >> later we attach a score/priority ;)
> >>
> >>
> > I'm not sure how that would work.
> > Where would you place that `__attribute__((implementation))` ? We do
not
> > have the definitions for `__nv_*` as they come from NVIDIA-provided
> > bitcode. We could add the attribute to the declaration in
> > `__clang_cuda_libdevice_declares.h`.
> > How does LLVM handle the differences in function attributes between
> > function declaration and definition? Will there be trouble when we
link
> in
> > the actual __nv_cos from the libdevice that would not have that
> attribute?
> >
> > Another potential gotcha is that for the functions that can't be
directly
> > mapped 1:1 to `__nv_*` counterparts, we'd still need to provide
the
> > implementation ourselves. We will not know whether the implementation
> will
> > be used until after the substitution pass, so we'll need to make
sure
> it's
> > not DCE'd until then. It appears to be the same issue (though on a
> smaller
> > scale) as with linking in libdevice directly.
> >
> > Let's take a step back and figure out what are the issues we want
to
> solve.
> >
> > The top-level goal is to provide implementation for LLVM intrinsics.
For
> > now let's stick with libm-related ones.
> > What we have is the libdevice bitcode which uses different function
names
> > and provides a subset of the functionality we need.
> > What we miss is
> >    - something to connect LLVM's libcalls to the GPU-side
implementation,
> >    - additional code to provide implementations for the functions that
> are
> > missing or different in libdevice.
> >
> > Considering that we want this to work in LLVM, the additional code
would
> > have to be a bitcode and it would have to exist in addition to
libdevice.
> > Our options for the mapping between LLVM intrinsics and the
> implementation
> > are
> > * intrinsic -> __nv_* equivalent mapping pass
> >     This would still need additional bitcode for the missing/different
> > functions.
> > * lower libcalls to the standard libm APIs, implement libm ->
__nv_*
> > mapping in our own bitcode.
> >
> > Considering that additional bitcode is needed in both cases, I believe
> that
> > the second approach makes more sense.
>
> I really hope to avoid any additional bitcode, there are too many
> drawbacks and basically no benefits, IMHO.
>
Could you elaborate on the drawbacks?

The fact is that we already depend on the external bitcode (libdevice in
this case), though right now we're trying to keep that to clang only. The
current approach is not sound in principle and is rather brittle in
practice. Nor clang is the only source of the IR for the LLVM to
compile, so it leaves LLVM-only users without a good solution. There are
already a handful of JIT compilers that each do their own gluing of
libdevice into the IR they want to compile for NVPTX. I think we do have a
very good reason to deal with that in LLVM itself.

While I agree that additional bitcode is a hassle, I think it would be a
net positive change for LLVM usability for NVPTX users.
The external bitcode would not be required for those who do not need
libdevice now, so the change should not be disruptive.
>
> > LLVM does not need to know or care about what's provided by
libdevice,
> and
> > we'd have more flexibility, compared to what we could do in the
mapping
> > pass. It also makes it easy to substitute a different implementation,
if
> we
> > have or need one.
>
> I agree that LLVM (core) should not know about __nv_*, that's why I
> suggested
> the `__attribute__((implements("...")))` approach. My preferred
solution
> is still to annotate our declarations of __nv_* and point to the
> llvm.intrinsics (name) from there. If we have a missing mapping, we
> point to an
> intrinsic from a definition that lives in the Clang headers next to the
> __nv_*  declarations.
>
We may have slightly different end goals in mind.
I was thinking of making the solution work for LLVM. I.e. users would be
free to use llvm.sin with NVPTX back-end with a few documented steps needed
to make it work (basically "pass additional
-link-libm-bitcode=path/to/bitcode_libm.bc").

Your scenario above suggests that the goal is to allow clang to generate
both llvm intrinsics and the glue which would then be used by LLVM to make
it work for clang, but not in general. It's an improvement compared to what
we have now, but I still think we should try a more general solution.

>
> This does not yet work because -mlink-builtin-bitcode (which I assume
> triggers the llvm-link logic) will drop the attributes of a declaration
> if a definition is found. I think that should not be the case anyway
> such that the union of attributes is set.
>
> The benefit I see for the above is that the mapping is tied to the
> declarations and doesn't live in a tablegen file far away. It works
well
> even if we can't map 1:1, and we could even restrict the
"used" attribute
> to anything that has an "implements" attribute.

I do not think we need tablegen for anything here. I was thinking of just
compiling a real math library (or a wrapper on top of libdevice) from C/C++
sources.

Our approaches are not mutually exclusive. If there's a strong opposition
to providing a bitcode libm for NVPTX, implementing it somewhere closer to
clang would still be an improvement, even if it's not as general as I'd
like. It should still be possible to allow LLVM to lower libcalls in NVPTX
to standard libm API, enabled with a flag, and just let the end users who
are interested (e.g. JITs) to provide their own implementation.

--Artem


> So:
>
> ```
> __nv_A() { ... } // called, inlined and optimized as before, DCE'ed
after.
>
> __nv_B() { ... } // not called, DCE'ed.
>
> __attribute__((implements("llvm.C"))
> __nv_C() { ... } // calls are inlined and optimized as before, not
DCE'ed
>                   // though because of the attribute. Replaces llvm.C as
>                   // callee in the special pass.
> ```
>
> So "implements" gives you a way to statically replace a function
> declaration
> or definition with another one. I could see it being used to provide other
> intrinsics to platforms with backends that don't support them.
>
> Does that make some sense?
>
> ~ Johannes
>
>
> >
> > WDYT?
> >
> > --Artem
> >
> >
> >>
> >> On 3/10/21 8:44 PM, William Moses wrote:
> >>> We could also consider doing something slightly broader.
> >>>
> >>> For example we could define a special attribute on top of the
llvm.cos
> >>> call/declaration etc with metadata or an attribute that points
to the
> >>> actual __nv_cos function. Then in a subsequent lowering pass
the
> >>> corresponding intrinsic with the relevant attribute has its
uses
> replaced
> >>> by the actual function.
> >>>
> >>>
> >>> On Wed, Mar 10, 2021 at 7:57 PM Johannes Doerfert <
> >>> johannesdoerfert at gmail.com> wrote:
> >>>
> >>>> On 3/10/21 6:22 PM, Artem Belevich wrote:
> >>>>> On Wed, Mar 10, 2021 at 3:44 PM Johannes Doerfert <
> >>>>> johannesdoerfert at gmail.com> wrote:
> >>>>>
> >>>>>> On 3/10/21 4:38 PM, Artem Belevich wrote:
> >>>>>>> On Wed, Mar 10, 2021 at 1:55 PM Johannes
Doerfert <
> >>>>>>> johannesdoerfert at gmail.com> wrote:
> >>>>>>>
> >>>>>>>> On 3/10/21 3:25 PM, Artem Belevich wrote:
> >>>>>>>>> On Wed, Mar 10, 2021 at 12:57 PM
Johannes Doerfert <
> >>>>>>>>> johannesdoerfert at gmail.com>
wrote:
> >>>>>>>>>
> >>>>>>>>>> Right. We could keep the
definition of __nv_cos and friends
> >>>>>>>>>> around. Right now, -ffast-math
might just crash on the user,
> >>>>>>>>>> which is arguably a bad thing. I
can also see us benefiting
> >>>>>>>>>> in various other ways from
llvm.cos uses instead of __nv_cos
> >>>>>>>>>> (assuming precision is according
to the user requirements but
> >>>>>>>>>> that is always a condition).
> >>>>>>>>>>
> >>>>>>>>>> It could be as simple as
introducing __nv_cos into
> >>>>>>>>>> "llvm.used" and a
backend matching/rewrite pass.
> >>>>>>>>>>
> >>>>>>>>>> If the backend knew the libdevice
location it could even pick
> >>>>>>>>>> the definitions from there. Maybe
we could link libdevice late
> >>>>>>>>>> instead of eager?
> >>>>>>>>>>
> >>>>>>>>> It's possible, but it would
require plumbing in CUDA SDK
> awareness
> >>>> into
> >>>>>>>>> LLVM. While clang driver can deal with
that, LLVM currently
> can't.
> >>>> The
> >>>>>>>>> bitcode library path would have to be
provided by the user.
> >>>>>>>> The PTX backend could arguably be CUDA SDK
aware, IMHO, it would
> >>>>>>>> even be fine if the middle-end does the
remapping to get inlining
> >>>>>>>> and folding benefits also after __nv_cos
is used. See below.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>> The standard library as bitcode raises
some questions.
> >>>>>>>> Which standard library? CUDAs libdevice is
a bitcode library,
> right?
> >>>>>>>>
> >>>>>>> It's whatever LLVM will need to lower
libcalls to. libdevice
> bitcode
> >> is
> >>>>>> the
> >>>>>>> closest approximation of that we have at the
moment.
> >>>>>>>
> >>>>>>>
> >>>>>>>>> * When do we want to do the linking?
If we do it at the
> beginning,
> >>>> then
> >>>>>>>> the
> >>>>>>>>> question is how to make sure unused
functions are not eliminated
> >>>> before
> >>>>>>>> we
> >>>>>>>>> may need them, as we don't know
apriori what's going to be
> needed.
> >> We
> >>>>>>>> also
> >>>>>>>>> do want the unused functions to be
gone after we're done. Linking
> >> it
> >>>> in
> >>>>>>>>> early would allow optimizing the code
better at the expense of
> >> having
> >>>>>> to
> >>>>>>>>> optimize a lot of code we'll throw
away. Linking it in late has
> >> less
> >>>>>>>>> overhead, but leaves the linked in
bitcode unoptimized, though
> it's
> >>>>>>>>> probably in the ballpark of what would
happen with a real library
> >>>> call.
> >>>>>>>>> I.e. no inlining, etc.
> >>>>>>>>>
> >>>>>>>>> * It incorporates linking into LLVM,
which is not LLVM's job.
> >>>> Arguably,
> >>>>>>>> the
> >>>>>>>>> line should be drawn at the lowering
to libcalls as it's done for
> >>>> other
> >>>>>>>>> back-ends. However, we're also
constrained to by the need to have
> >> the
> >>>>>>>>> linking done before we generate PTX
which prevents doing it after
> >>>> LLVM
> >>>>>> is
> >>>>>>>>> done generating an object file.
> >>>>>>>> I'm confused. Clang links in
libdevice.bc early.
> >>>>>>> Yes. Because that's where it has to happen
if we want to keep LLVM
> >>>>>> unaware
> >>>>>>> of CUDA SDK.
> >>>>>>> It does not have to be the case if/when LLVM
can do the linking
> >> itself.
> >>>>>>>
> >>>>>>>> If we make sure
> >>>>>>>> `__nv_cos` is not deleted early, we can at
any point "lower"
> >>>> `llvm.cos`
> >>>>>>>> to `__nv_cos` which is available. After
the lowering we can remove
> >>>>>>>> the artificial uses of `__nv_XXX`
functions that we used to keep
> the
> >>>>>>>> definitions around in order to remove them
from the final result.
> >>>>>>>>
> >>>>>>> This is the 'link early' approach, I
should've been explicit that
> >> it's
> >>>>>>> 'link early *everything*' as opposed
to linking only what's needed
> at
> >>>> the
> >>>>>>> beginning.
> >>>>>>> It would work at the expense of having to
process/optimize 500KB
> >> worth
> >>>> of
> >>>>>>> bitcode for every compilation, whether it
needs it or not.
> >>>>>>>
> >>>>>>>
> >>>>>>>> We get the benefit of having `llvm.cos`
for some of the pipeline,
> >>>>>>>> we know it does not have all the bad
effects while `__nv_cos` is
> >>>> defined
> >>>>>>>> with inline assembly. We also get the
benefit of inlining
> `__nv_cos`
> >>>>>>>> and folding the implementation based on
the arguments. Finally,
> >>>>>>>> this should work with the existing
pipeline, the linking is the
> same
> >>>>>>>> as before, all we do is to keep the
definitions alive longer and
> >>>>>>>> lower `llvm.cos` to `__nv_cos` in a middle
end pass.
> >>>>>>>>
> >>>>>>> Again, I agree that it is doable.
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>> This might be similar to the PTX solution
you describe below but I
> >>>> feel
> >>>>>>>> we get the inline benefit from this
without actually changing the
> >>>>>> pipeline
> >>>>>>>> at all.
> >>>>>>>>
> >>>>>>> So, to summarize:
> >>>>>>> * link the library as bitcode early, add
artificial placeholders
> for
> >>>>>>> everything, compile, remove placeholders and
DCE unused stuff away.
> >>>>>>>       Pros:
> >>>>>>>          - we're already doing most of it
before clang hands hands
> off
> >>>> IR to
> >>>>>>> LLVM, so it just pushes it a bit lower in the
compilation.
> >>>>>>>       Cons:
> >>>>>>>          - runtime cost of optimizing
libdevice bitcode,
> >>>>>>>          - libdevice may be required for all
NVPTX compilations?
> >>>>>>>
> >>>>>>> * link the library as bitcode late.
> >>>>>>>        Pros:
> >>>>>>>          - lower runtime cost than link-early
approach.
> >>>>>>>        Cons:
> >>>>>>>          - We'll need to make sure that
NVVMReflect pass processes
> the
> >>>>>> library.
> >>>>>>>          - less optimizations on the library
functions. Some of the
> >> code
> >>>>>> gets
> >>>>>>> DCE'ed away after NVVMReflect and the rest
could be optimized
> better.
> >>>>>>>          - libdevice may be required for all
NVPTX compilations?
> >>>>>>> * 'link' with the library as PTX
appended as text to LLVM's output
> >> and
> >>>>>> let
> >>>>>>> ptxas do the 'linking'
> >>>>>>>       Pros:  LLVM remains agnostic of CUDA SDK
installation
> details.
> >> All
> >>>> it
> >>>>>>> does is allows lowering libcalls and leaves
their resolution to the
> >>>>>>> external tools.
> >>>>>>>       Cons: Need to have the PTX library
somewhere and need to
> >> integrate
> >>>> the
> >>>>>>> 'linking' into the compilation process
somehow.
> >>>>>>>
> >>>>>>> Neither is particularly good. If the runtime
overhead of link-early
> >> is
> >>>>>>> acceptable, then it may be a winner here, by a
very small margin.
> >>>>>>> link-as-PTX may be better conceptually as it
keeps linking and
> >>>>>> compilation
> >>>>>>> separate.
> >>>>>>>
> >>>>>>> As for the practical steps, here's what we
need:
> >>>>>>> - allow libcall lowering in NVPTX, possibly
guarded by a flag. This
> >> is
> >>>>>>> needed for all of the approaches above.
> >>>>>>> - teach LLVM how to link in bitcode (and,
possibly, control
> >> early/late
> >>>>>> mode)
> >>>>>>> - teach clang driver to delegate libdevice
linking to LLVM.
> >>>>>>>
> >>>>>>> This will allow us to experiment with all
three approaches and see
> >> what
> >>>>>>> works best.
> >>>>>> I think if we embed knowledge about the nv_XXX
functions we can
> >>>>>> even get away without the cons you listed for
early linking above.
> >>>>>>
> >>>>> WDYM by `embed knowledge about the nv_XXX functions`?
By linking
> those
> >>>>> functions in? Of do you mean that we should just
declare them
> >>>>> before/instead of linking libdevice in?
> >>>> I mean by providing the "libcall lowering" pass.
So the knowledge
> >>>> that llvm.cos maps to __nv_cos.
> >>>>
> >>>>>> For early link I'm assuming an order similar
to [0] but I also
> discuss
> >>>>>> the case where we don't link libdevice early
for a TU.
> >>>>>>
> >>>>> That link just describes the steps needed to use
libdevice. It does
> not
> >>>>> deal with how/where it fits in the LLVM pipeline.
> >>>>> The gist is that NVVMreflect replaces some
conditionals with
> constants.
> >>>>> libdevice uses that as a poor man's IR
preprocessor, conditionally
> >>>> enabling
> >>>>> different implementations and relying on DCE and
constant folding to
> >>>> remove
> >>>>> unused parts and eliminate the now useless branches.
> >>>>> While running NVVM alone will make libdevice code
valid and usable,
> it
> >>>>> would still benefit from further optimizations. I do
not know to what
> >>>>> degree, though.
> >>>>>
> >>>>>
> >>>>>> Link early:
> >>>>>> 1) clang emits module.bc and links in libdevice.bc
but with the
> >>>>>>        `optnone`, `noinline`, and "used"
attribute for functions in
> >>>>>>        libdevice. ("used" is not an
attribute but could as well be.)
> >>>>>>        At this stage module.bc might call __nv_XXX
or llvm.XXX
> freely
> >>>>>>        as defined by -ffast-math and friends.
> >>>>>>
> >>>>> That could work. Just carrying extra IR around would
probably be OK.
> >>>>> We may want to do NVVMReflect as soon as we have it
linked in and,
> >> maybe,
> >>>>> allow optimizing the functions that are explicitly
used already.
> >>>> Right. NVVMReflect can be run twice and with
`alwaysinline`
> >>>> on the call sites of __nv_XXX functions we will actually
> >>>> inline and optimize them while the definitions are just
"dragged
> >>>> along" in case we need them later.
> >>>>
> >>>>
> >>>>>> 2) Run some optimizations in the middle end, maybe
till the end of
> >>>>>>        the inliner loop, unsure.
> >>>>>> 3) Run a libcall lowering pass and another
NVVMReflect pass (or the
> >>>>>>        only instance thereof). We effectively
remove all llvm.XXX
> calls
> >>>>>        in favor of __nv_XXX now. Note that we
haven't spend (much)
> time
> >>>>>>        on the libdevice code as it is optnone and
most passes are
> good
> >>>>>>        at skipping those. To me, it's unclear
if the used parts
> should
> >>>>>>        not be optimized before we inline them
anyway to avoid
> redoing
> >>>>>>        the optimizations over and over (per call
site). That needs
> >>>>>>        measuring I guess. Also note that we can
still retain the
> >> current
> >>>>>>        behavior for direct calls to __nv_XXX if we
mark the call
> sites
> >>>>>>        as `alwaysinline`, or at least the behavior
is almost like
> the
> >>>>>>        current one is.
> >>>>>> 4) Run an always inliner pass on the __nv_XXX
calls because it is
> >>>>>>        something we would do right now.
Alternatively, remove
> `optnone`
> >>>>>>        and `noinline` from the __nv_XXX calls.
> >>>>>> 5) Continue with the pipeline as before.
> >>>>>>
> >>>>>>
> >>>>> SGTM.
> >>>>>
> >>>>>
> >>>>>> As mentioned above, `optnone` avoids spending time
on the libdevice
> >>>>>> until we "activate" it. At that point
(globals) DCE can be scheduled
> >>>>>> to remove all unused parts right away. I don't
think this is (much)
> >>>>>> more expensive than linking libdevice early right
now.
> >>>>>>
> >>>>>> Link late, aka. translation units without
libdevice:
> >>>>>> 1) clang emits module.bc but does not link in
libdevice.bc, it will
> be
> >>>>>>        made available later. We still can mix
__nv_XXX and llvm.XXX
> >> calls
> >>>>>>        freely as above.
> >>>>>> 2) Same as above.
> >>>>>> 3) Same as above.
> >>>>>> 4) Same as above but effectively a no-op, no
__nv_XXX definitions
> are
> >>>>>>        available.
> >>>>>> 5) Same as above.
> >>>>>>
> >>>>>>
> >>>>>> I might misunderstand something about the current
pipeline but from
> >> [0]
> >>>>>> and the experiments I run locally it looks like
the above should
> cover
> >>>> all
> >>>>>> the cases. WDYT?
> >>>>>>
> >>>>>>
> >>>>> The `optnone` trick may indeed remove much of the
practical
> differences
> >>>>> between the early/late approaches.
> >>>>> In principle it should work.
> >>>>>
> >>>>> Next question is -- is libdevice sufficient to satisfy
LLVM's
> >> assumptions
> >>>>> about the standard library.
> >>>>> While it does provide most of the equivalents of libm
functions, the
> >> set
> >>>> is
> >>>>> not complete and some of the functions differ from
their libm
> >>>> counterparts.
> >>>>> The differences are minor, so we should be able to
deal with it by
> >>>>> generating few wrapper functions for the odd cases.
> >>>>> Here's what clang does to provide math functions
using libdevice:
> >>>>>
> >>
>
https://github.com/llvm/llvm-project/blob/main/clang/lib/Headers/__clang_cuda_math.h
> >>>> Right now, clang will generate any llvm intrinsic and we
crash, so
> >> anything
> >>>> else is probably a step in the right direction.
Eventually, we should
> >>>> "lower"
> >>>> all intrinsics that the NVPTX backend can't handle or
at least emit a
> >> nice
> >>>> error message. Preferably, clang would know what we
can't deal with
> and
> >> not
> >>>> generate intinsic calls for those in the first place.
> >>>>
> >>>>
> >>>>> The most concerning aspect of libdevice is that we
don't know when
> >> we'll
> >>>> no
> >>>>> longer be able to use the libdevice bitcode? My
understanding is that
> >> IR
> >>>>> does not guarantee binary stability and at some point
we may just be
> >>>> unable
> >>>>> to use it. Ideally we need our own libm for GPUs.
> >>>> For OpenMP I did my best to avoid writing libm (code) for
GPUs by
> piggy
> >>>> backing on CUDA and libc++ implementations, I hope it will
stay that
> >> way.
> >>>> That said, if the need arises we might really have to port
libc++ to
> the
> >>>> GPUs.
> >>>>
> >>>> Back to the problem with libdevice. I agree that the
solution of
> NVIDIA
> >>>> to ship a .bc library is suboptimal but with the existing,
or an
> >> extended,
> >>>> auto-upgrader we might be able to make that work
reasonably well for
> the
> >>>> foreseeable future. That problem is orthogonal to what we
are
> discussing
> >>>> above, I think.
> >>>>
> >>>> ~ Johannes
> >>>>
> >>>>
> >>>>> --Artem
> >>>>>
> >>>>>
> >>>>>> ~ Johannes
> >>>>>>
> >>>>>>
> >>>>>> P.S. If the rewrite capability (aka libcall
lowering) is generic we
> >>>> could
> >>>>>>          use the scheme for many other things as
well.
> >>>>>>
> >>>>>>
> >>>>>> [0]
https://llvm.org/docs/NVPTXUsage.html#linking-with-libdevice
> >>>>>>
> >>>>>>
> >>>>>>> --Artem
> >>>>>>>
> >>>>>>>
> >>>>>>>> ~ Johannes
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>> One thing that may work within the
existing compilation model is
> to
> >>>>>>>>> pre-compile the standard library into
PTX and then textually
> embed
> >>>>>>>> relevant
> >>>>>>>>> functions into the generated PTX, thus
pushing the 'linking'
> phase
> >>>> past
> >>>>>>>> the
> >>>>>>>>> end of LLVM's compilation and make
it look closer to the standard
> >>>>>>>>> compile/link process. This way
we'd only enable libcall lowering
> in
> >>>>>>>> NVPTX,
> >>>>>>>>> assuming that the library functions
will be magically available
> out
> >>>>>>>> there.
> >>>>>>>>> Injection of PTX could be done with an
external script outside of
> >>>> LLVM
> >>>>>>>> and
> >>>>>>>>> it could be incorporated into clang
driver. Bonus points for the
> >> fact
> >>>>>>>> that
> >>>>>>>>> this scheme is compatible with
-fgpu-rdc out of the box --
> assemble
> >>>> the
> >>>>>>>> PTX
> >>>>>>>>> with `ptxas -rdc` and then actually
link with the library,
> instead
> >> of
> >>>>>>>>> injecting its PTX before invoking
ptxas.
> >>>>>>>>>
> >>>>>>>>> --Artem
> >>>>>>>>>
> >>>>>>>>> Trying to figure out a good way to
have the cake and eat it too.
> >>>>>>>>>> ~ Johannes
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On 3/10/21 2:49 PM, William Moses
wrote:
> >>>>>>>>>>> Since clang (and arguably any
other frontend that uses) should
> >> link
> >>>>>> in
> >>>>>>>>>>> libdevice, could we lower
these intrinsics to the libdevice
> code?
> >>>>>>>>> The linking happens *before* LLVM gets
to work on IR.
> >>>>>>>>> As I said, it's a workaround, not
the solution. It's possible for
> >>>> LLVM
> >>>>>> to
> >>>>>>>>> still attempt lowering something in
the IR into a libcall and we
> >>>> would
> >>>>>>>> not
> >>>>>>>>> be able to deal with that. It happens
to work well enough in
> >>>> practice.
> >>>>>>>>> Do you have an example where you see
the problem with
> -ffast-math?
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>>> For example, consider
compiling the simple device function
> below:
> >>>>>>>>>>>
> >>>>>>>>>>> ```
> >>>>>>>>>>> //
/mnt/sabrent/wmoses/llvm13/build/bin/clang tmp.cu -S
> >> -emit-llvm
> >>>>>>>>>>>       
--cuda-path=/usr/local/cuda-11.0
> >> -L/usr/local/cuda-11.0/lib64
> >>>>>>>>>>> --cuda-gpu-arch=sm_37
> >>>>>>>>>>> __device__ double f(double x)
{
> >>>>>>>>>>>           return cos(x);
> >>>>>>>>>>> }
> >>>>>>>>>>> ```
> >>>>>>>>>>>
> >>>>>>>>>>> The LLVM module for it is as
follows:
> >>>>>>>>>>>
> >>>>>>>>>>> ```
> >>>>>>>>>>> ...
> >>>>>>>>>>> define dso_local double
@_Z1fd(double %x) #0 {
> >>>>>>>>>>> entry:
> >>>>>>>>>>>         %__a.addr.i = alloca
double, align 8
> >>>>>>>>>>>         %x.addr = alloca
double, align 8
> >>>>>>>>>>>         store double %x,
double* %x.addr, align 8
> >>>>>>>>>>>         %0 = load double,
double* %x.addr, align 8
> >>>>>>>>>>>         store double %0,
double* %__a.addr.i, align 8
> >>>>>>>>>>>         %1 = load double,
double* %__a.addr.i, align 8
> >>>>>>>>>>>         %call.i = call
contract double @__nv_cos(double %1) #7
> >>>>>>>>>>>         ret double %call.i
> >>>>>>>>>>> }
> >>>>>>>>>>>
> >>>>>>>>>>> define internal double
@__nv_cos(double %a) #1 {
> >>>>>>>>>>>         %q.i = alloca i32,
align 4
> >>>>>>>>>>> ```
> >>>>>>>>>>>
> >>>>>>>>>>> Obviously we would need to do
something to ensure these
> functions
> >>>>>> don't
> >>>>>>>>>> get
> >>>>>>>>>>> deleted prior to their use in
lowering from intrinsic to
> >> libdevice.
> >>>>>>>>>>> ...
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> On Wed, Mar 10, 2021 at 3:39
PM Artem Belevich <tra at google.com
> >
> >>>>>> wrote:
> >>>>>>>>>>>> On Wed, Mar 10, 2021 at
11:41 AM Johannes Doerfert <
> >>>>>>>>>>>> johannesdoerfert at
gmail.com> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>> Artem, Justin,
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> I am running into a
problem and I'm curious if I'm missing
> >>>>>> something
> >>>>>>>> or
> >>>>>>>>>>>>> if the support is
simply missing.
> >>>>>>>>>>>>> Am I correct to assume
the NVPTX backend does not deal with
> >>>>>>>> `llvm.sin`
> >>>>>>>>>>>>> and friends?
> >>>>>>>>>>>>>
> >>>>>>>>>>>> Correct. It can't deal
with anything that may need to lower
> to a
> >>>>>>>>>> standard
> >>>>>>>>>>>> library call.
> >>>>>>>>>>>>
> >>>>>>>>>>>>> This is what I see,
with some variations:
> >>>>>>>> https://godbolt.org/z/PxsEWs
> >>>>>>>>>>>>> If this is missing in
the backend, is there a plan to get
> this
> >>>>>>>> working,
> >>>>>>>>>>>>> I'd really like to
have the
> >>>>>>>>>>>>> intrinsics in the
middle end rather than __nv_cos, not to
> >> mention
> >>>>>>>> that
> >>>>>>>>>>>>> -ffast-math does emit
intrinsics
> >>>>>>>>>>>>> and crashes.
> >>>>>>>>>>>>>
> >>>>>>>>>>>> It all boils down to the
fact that PTX does not have the
> >> standard
> >>>>>>>>>>>> libc/libm which LLVM could
lower the calls to, nor does it
> have
> >> a
> >>>>>>>>>> 'linking'
> >>>>>>>>>>>> phase where we could link
such a library in, if we had it.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Libdevice bitcode does
provide the implementations for some of
> >> the
> >>>>>>>>>>>> functions (though with a
__nv_ prefix) and clang links it in
> in
> >>>>>> order
> >>>>>>>> to
> >>>>>>>>>>>> avoid generating IR that
LLVM can't handle, but that's a
> >>>> workaround
> >>>>>>>> that
> >>>>>>>>>>>> does not help LLVM itself.
> >>>>>>>>>>>>
> >>>>>>>>>>>> --Artem
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>> ~ Johannes
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> --
> >>>>>>>>>>>>> ───────────────────
> >>>>>>>>>>>>> ∽ Johannes (he/his)
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>> --
> >>>>>>>>>>>> --Artem Belevich
> >>>>>>>>>>>>
> >
>

-- 
--Artem Belevich
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20210311/45fc1b10/attachment.html>

Johannes Doerfert via llvm-dev

2021-Mar-12 04:26 UTC

head link

[llvm-dev] NVPTX codegen for llvm.sin (and friends)

On 3/11/21 7:59 PM, Artem Belevich wrote:> On Thu, Mar 11, 2021 at 4:10 PM Johannes Doerfert <
> johannesdoerfert at gmail.com> wrote:
>
>> On 3/11/21 1:37 PM, Artem Belevich wrote:
>>> On Thu, Mar 11, 2021 at 10:54 AM Johannes Doerfert <
>>> johannesdoerfert at gmail.com> wrote:
>>>
>>>> I certainly agree we should try to avoid a hard-coded mapping
>>>> in C++.
>>>>
>>>> I could see something like:
>>>>
>>>> ```
>>>> __attribute__((implementation("llvm.cos"))
>>>> double __nv_cos(...) { ... }
>>>>
>>>> ```
>>>>
>>>> and a pass that transforms all calls to a function with an
>>>> "implementation" to calls to that implementation.
Maybe
>>>> later we attach a score/priority ;)
>>>>
>>>>
>>> I'm not sure how that would work.
>>> Where would you place that `__attribute__((implementation))` ? We
do not
>>> have the definitions for `__nv_*` as they come from NVIDIA-provided
>>> bitcode. We could add the attribute to the declaration in
>>> `__clang_cuda_libdevice_declares.h`.
>>> How does LLVM handle the differences in function attributes between
>>> function declaration and definition? Will there be trouble when we
link
>> in
>>> the actual __nv_cos from the libdevice that would not have that
>> attribute?
>>> Another potential gotcha is that for the functions that can't
be directly
>>> mapped 1:1 to `__nv_*` counterparts, we'd still need to provide
the
>>> implementation ourselves. We will not know whether the
implementation
>> will
>>> be used until after the substitution pass, so we'll need to
make sure
>> it's
>>> not DCE'd until then. It appears to be the same issue (though
on a
>> smaller
>>> scale) as with linking in libdevice directly.
>>>
>>> Let's take a step back and figure out what are the issues we
want to
>> solve.
>>> The top-level goal is to provide implementation for LLVM
intrinsics. For
>>> now let's stick with libm-related ones.
>>> What we have is the libdevice bitcode which uses different function
names
>>> and provides a subset of the functionality we need.
>>> What we miss is
>>>     - something to connect LLVM's libcalls to the GPU-side
implementation,
>>>     - additional code to provide implementations for the functions
that
>> are
>>> missing or different in libdevice.
>>>
>>> Considering that we want this to work in LLVM, the additional code
would
>>> have to be a bitcode and it would have to exist in addition to
libdevice.
>>> Our options for the mapping between LLVM intrinsics and the
>> implementation
>>> are
>>> * intrinsic -> __nv_* equivalent mapping pass
>>>      This would still need additional bitcode for the
missing/different
>>> functions.
>>> * lower libcalls to the standard libm APIs, implement libm ->
__nv_*
>>> mapping in our own bitcode.
>>>
>>> Considering that additional bitcode is needed in both cases, I
believe
>> that
>>> the second approach makes more sense.
>> I really hope to avoid any additional bitcode, there are too many
>> drawbacks and basically no benefits, IMHO.
>>
> Could you elaborate on the drawbacks?
>
> The fact is that we already depend on the external bitcode (libdevice in
> this case), though right now we're trying to keep that to clang only.
The
> current approach is not sound in principle and is rather brittle in
> practice. Nor clang is the only source of the IR for the LLVM to
> compile, so it leaves LLVM-only users without a good solution. There are
> already a handful of JIT compilers that each do their own gluing of
> libdevice into the IR they want to compile for NVPTX. I think we do have a
> very good reason to deal with that in LLVM itself.
>
> While I agree that additional bitcode is a hassle, I think it would be a
> net positive change for LLVM usability for NVPTX users.
> The external bitcode would not be required for those who do not need
> libdevice now, so the change should not be disruptive.
Bitcode comes with all the problems libdevice itself has wrt.
compatibility. It is also hard to update and maintain. You basically
maintain IR or you maintain C(++) as I suggest. Also, bitcode is
platform specific. I can imagine building a bitcode file during the
build but shipping one means you have to know ABI and datalayout or
hope they are the same everywhere.


>>> LLVM does not need to know or care about what's provided by
libdevice,
>> and
>>> we'd have more flexibility, compared to what we could do in the
mapping
>>> pass. It also makes it easy to substitute a different
implementation, if
>> we
>>> have or need one.
>> I agree that LLVM (core) should not know about __nv_*, that's why I
>> suggested
>> the `__attribute__((implements("...")))` approach. My
preferred solution
>> is still to annotate our declarations of __nv_* and point to the
>> llvm.intrinsics (name) from there. If we have a missing mapping, we
>> point to an
>> intrinsic from a definition that lives in the Clang headers next to the
>> __nv_*  declarations.
>>
> We may have slightly different end goals in mind.
> I was thinking of making the solution work for LLVM. I.e. users would be
> free to use llvm.sin with NVPTX back-end with a few documented steps needed
> to make it work (basically "pass additional
> -link-libm-bitcode=path/to/bitcode_libm.bc").
>
> Your scenario above suggests that the goal is to allow clang to generate
> both llvm intrinsics and the glue which would then be used by LLVM to make
> it work for clang, but not in general. It's an improvement compared to
what
> we have now, but I still think we should try a more general solution.
>My scenario doesn't disallow a bitcode approach for non-clang
frontends, nor does it disallow them to simply build the glue code
with clang and package it themselves. It does however allow us to
maintain C(++) code rather than IR, which is by itself a big win.

>> This does not yet work because -mlink-builtin-bitcode (which I assume
>> triggers the llvm-link logic) will drop the attributes of a declaration
>> if a definition is found. I think that should not be the case anyway
>> such that the union of attributes is set.
>>
>> The benefit I see for the above is that the mapping is tied to the
>> declarations and doesn't live in a tablegen file far away. It works
well
>> even if we can't map 1:1, and we could even restrict the
"used" attribute
>> to anything that has an "implements" attribute.
>
> I do not think we need tablegen for anything here. I was thinking of just
> compiling a real math library (or a wrapper on top of libdevice) from C/C++
> sources.
I did not understand your suggestion before. Agreed, no tablegen.

>
> Our approaches are not mutually exclusive. If there's a strong
opposition
> to providing a bitcode libm for NVPTX, implementing it somewhere closer to
> clang would still be an improvement, even if it's not as general as
I'd
> like. It should still be possible to allow LLVM to lower libcalls in NVPTX
> to standard libm API, enabled with a flag, and just let the end users who
> are interested (e.g. JITs) to provide their own implementation.
Right. And their own implementation could be trivially created for
them as bc file:

`clang -emit-llvm-bc $clang_src/.../__clang_cuda_cmath.h -femit-all-decls`

Or am I missing something here?

~ Johannes

>
> --Artem
>
>
>
>> So:
>>
>> ```
>> __nv_A() { ... } // called, inlined and optimized as before, DCE'ed
after.
>>
>> __nv_B() { ... } // not called, DCE'ed.
>>
>> __attribute__((implements("llvm.C"))
>> __nv_C() { ... } // calls are inlined and optimized as before, not
DCE'ed
>>                    // though because of the attribute. Replaces llvm.C
as
>>                    // callee in the special pass.
>> ```
>>
>> So "implements" gives you a way to statically replace a
function
>> declaration
>> or definition with another one. I could see it being used to provide
other
>> intrinsics to platforms with backends that don't support them.
>>
>> Does that make some sense?
>>
>> ~ Johannes
>>
>>
>>> WDYT?
>>>
>>> --Artem
>>>
>>>
>>>> On 3/10/21 8:44 PM, William Moses wrote:
>>>>> We could also consider doing something slightly broader.
>>>>>
>>>>> For example we could define a special attribute on top of
the llvm.cos
>>>>> call/declaration etc with metadata or an attribute that
points to the
>>>>> actual __nv_cos function. Then in a subsequent lowering
pass the
>>>>> corresponding intrinsic with the relevant attribute has its
uses
>> replaced
>>>>> by the actual function.
>>>>>
>>>>>
>>>>> On Wed, Mar 10, 2021 at 7:57 PM Johannes Doerfert <
>>>>> johannesdoerfert at gmail.com> wrote:
>>>>>
>>>>>> On 3/10/21 6:22 PM, Artem Belevich wrote:
>>>>>>> On Wed, Mar 10, 2021 at 3:44 PM Johannes Doerfert
<
>>>>>>> johannesdoerfert at gmail.com> wrote:
>>>>>>>
>>>>>>>> On 3/10/21 4:38 PM, Artem Belevich wrote:
>>>>>>>>> On Wed, Mar 10, 2021 at 1:55 PM Johannes
Doerfert <
>>>>>>>>> johannesdoerfert at gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> On 3/10/21 3:25 PM, Artem Belevich
wrote:
>>>>>>>>>>> On Wed, Mar 10, 2021 at 12:57 PM
Johannes Doerfert <
>>>>>>>>>>> johannesdoerfert at gmail.com>
wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Right. We could keep the
definition of __nv_cos and friends
>>>>>>>>>>>> around. Right now, -ffast-math
might just crash on the user,
>>>>>>>>>>>> which is arguably a bad thing.
I can also see us benefiting
>>>>>>>>>>>> in various other ways from
llvm.cos uses instead of __nv_cos
>>>>>>>>>>>> (assuming precision is
according to the user requirements but
>>>>>>>>>>>> that is always a condition).
>>>>>>>>>>>>
>>>>>>>>>>>> It could be as simple as
introducing __nv_cos into
>>>>>>>>>>>> "llvm.used" and a
backend matching/rewrite pass.
>>>>>>>>>>>>
>>>>>>>>>>>> If the backend knew the
libdevice location it could even pick
>>>>>>>>>>>> the definitions from there.
Maybe we could link libdevice late
>>>>>>>>>>>> instead of eager?
>>>>>>>>>>>>
>>>>>>>>>>> It's possible, but it would
require plumbing in CUDA SDK
>> awareness
>>>>>> into
>>>>>>>>>>> LLVM. While clang driver can deal
with that, LLVM currently
>> can't.
>>>>>> The
>>>>>>>>>>> bitcode library path would have to
be provided by the user.
>>>>>>>>>> The PTX backend could arguably be CUDA
SDK aware, IMHO, it would
>>>>>>>>>> even be fine if the middle-end does the
remapping to get inlining
>>>>>>>>>> and folding benefits also after
__nv_cos is used. See below.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> The standard library as bitcode
raises some questions.
>>>>>>>>>> Which standard library? CUDAs libdevice
is a bitcode library,
>> right?
>>>>>>>>> It's whatever LLVM will need to lower
libcalls to. libdevice
>> bitcode
>>>> is
>>>>>>>> the
>>>>>>>>> closest approximation of that we have at
the moment.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>> * When do we want to do the
linking? If we do it at the
>> beginning,
>>>>>> then
>>>>>>>>>> the
>>>>>>>>>>> question is how to make sure unused
functions are not eliminated
>>>>>> before
>>>>>>>>>> we
>>>>>>>>>>> may need them, as we don't know
apriori what's going to be
>> needed.
>>>> We
>>>>>>>>>> also
>>>>>>>>>>> do want the unused functions to be
gone after we're done. Linking
>>>> it
>>>>>> in
>>>>>>>>>>> early would allow optimizing the
code better at the expense of
>>>> having
>>>>>>>> to
>>>>>>>>>>> optimize a lot of code we'll
throw away. Linking it in late has
>>>> less
>>>>>>>>>>> overhead, but leaves the linked in
bitcode unoptimized, though
>> it's
>>>>>>>>>>> probably in the ballpark of what
would happen with a real library
>>>>>> call.
>>>>>>>>>>> I.e. no inlining, etc.
>>>>>>>>>>>
>>>>>>>>>>> * It incorporates linking into
LLVM, which is not LLVM's job.
>>>>>> Arguably,
>>>>>>>>>> the
>>>>>>>>>>> line should be drawn at the
lowering to libcalls as it's done for
>>>>>> other
>>>>>>>>>>> back-ends. However, we're also
constrained to by the need to have
>>>> the
>>>>>>>>>>> linking done before we generate PTX
which prevents doing it after
>>>>>> LLVM
>>>>>>>> is
>>>>>>>>>>> done generating an object file.
>>>>>>>>>> I'm confused. Clang links in
libdevice.bc early.
>>>>>>>>> Yes. Because that's where it has to
happen if we want to keep LLVM
>>>>>>>> unaware
>>>>>>>>> of CUDA SDK.
>>>>>>>>> It does not have to be the case if/when
LLVM can do the linking
>>>> itself.
>>>>>>>>>> If we make sure
>>>>>>>>>> `__nv_cos` is not deleted early, we can
at any point "lower"
>>>>>> `llvm.cos`
>>>>>>>>>> to `__nv_cos` which is available. After
the lowering we can remove
>>>>>>>>>> the artificial uses of `__nv_XXX`
functions that we used to keep
>> the
>>>>>>>>>> definitions around in order to remove
them from the final result.
>>>>>>>>>>
>>>>>>>>> This is the 'link early' approach,
I should've been explicit that
>>>> it's
>>>>>>>>> 'link early *everything*' as
opposed to linking only what's needed
>> at
>>>>>> the
>>>>>>>>> beginning.
>>>>>>>>> It would work at the expense of having to
process/optimize 500KB
>>>> worth
>>>>>> of
>>>>>>>>> bitcode for every compilation, whether it
needs it or not.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> We get the benefit of having `llvm.cos`
for some of the pipeline,
>>>>>>>>>> we know it does not have all the bad
effects while `__nv_cos` is
>>>>>> defined
>>>>>>>>>> with inline assembly. We also get the
benefit of inlining
>> `__nv_cos`
>>>>>>>>>> and folding the implementation based on
the arguments. Finally,
>>>>>>>>>> this should work with the existing
pipeline, the linking is the
>> same
>>>>>>>>>> as before, all we do is to keep the
definitions alive longer and
>>>>>>>>>> lower `llvm.cos` to `__nv_cos` in a
middle end pass.
>>>>>>>>>>
>>>>>>>>> Again, I agree that it is doable.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> This might be similar to the PTX
solution you describe below but I
>>>>>> feel
>>>>>>>>>> we get the inline benefit from this
without actually changing the
>>>>>>>> pipeline
>>>>>>>>>> at all.
>>>>>>>>>>
>>>>>>>>> So, to summarize:
>>>>>>>>> * link the library as bitcode early, add
artificial placeholders
>> for
>>>>>>>>> everything, compile, remove placeholders
and DCE unused stuff away.
>>>>>>>>>        Pros:
>>>>>>>>>           - we're already doing most of
it before clang hands hands
>> off
>>>>>> IR to
>>>>>>>>> LLVM, so it just pushes it a bit lower in
the compilation.
>>>>>>>>>        Cons:
>>>>>>>>>           - runtime cost of optimizing
libdevice bitcode,
>>>>>>>>>           - libdevice may be required for
all NVPTX compilations?
>>>>>>>>>
>>>>>>>>> * link the library as bitcode late.
>>>>>>>>>         Pros:
>>>>>>>>>           - lower runtime cost than
link-early approach.
>>>>>>>>>         Cons:
>>>>>>>>>           - We'll need to make sure
that NVVMReflect pass processes
>> the
>>>>>>>> library.
>>>>>>>>>           - less optimizations on the
library functions. Some of the
>>>> code
>>>>>>>> gets
>>>>>>>>> DCE'ed away after NVVMReflect and the
rest could be optimized
>> better.
>>>>>>>>>           - libdevice may be required for
all NVPTX compilations?
>>>>>>>>> * 'link' with the library as PTX
appended as text to LLVM's output
>>>> and
>>>>>>>> let
>>>>>>>>> ptxas do the 'linking'
>>>>>>>>>        Pros:  LLVM remains agnostic of CUDA
SDK installation
>> details.
>>>> All
>>>>>> it
>>>>>>>>> does is allows lowering libcalls and leaves
their resolution to the
>>>>>>>>> external tools.
>>>>>>>>>        Cons: Need to have the PTX library
somewhere and need to
>>>> integrate
>>>>>> the
>>>>>>>>> 'linking' into the compilation
process somehow.
>>>>>>>>>
>>>>>>>>> Neither is particularly good. If the
runtime overhead of link-early
>>>> is
>>>>>>>>> acceptable, then it may be a winner here,
by a very small margin.
>>>>>>>>> link-as-PTX may be better conceptually as
it keeps linking and
>>>>>>>> compilation
>>>>>>>>> separate.
>>>>>>>>>
>>>>>>>>> As for the practical steps, here's what
we need:
>>>>>>>>> - allow libcall lowering in NVPTX, possibly
guarded by a flag. This
>>>> is
>>>>>>>>> needed for all of the approaches above.
>>>>>>>>> - teach LLVM how to link in bitcode (and,
possibly, control
>>>> early/late
>>>>>>>> mode)
>>>>>>>>> - teach clang driver to delegate libdevice
linking to LLVM.
>>>>>>>>>
>>>>>>>>> This will allow us to experiment with all
three approaches and see
>>>> what
>>>>>>>>> works best.
>>>>>>>> I think if we embed knowledge about the nv_XXX
functions we can
>>>>>>>> even get away without the cons you listed for
early linking above.
>>>>>>>>
>>>>>>> WDYM by `embed knowledge about the nv_XXX
functions`? By linking
>> those
>>>>>>> functions in? Of do you mean that we should just
declare them
>>>>>>> before/instead of linking libdevice in?
>>>>>> I mean by providing the "libcall lowering"
pass. So the knowledge
>>>>>> that llvm.cos maps to __nv_cos.
>>>>>>
>>>>>>>> For early link I'm assuming an order
similar to [0] but I also
>> discuss
>>>>>>>> the case where we don't link libdevice
early for a TU.
>>>>>>>>
>>>>>>> That link just describes the steps needed to use
libdevice. It does
>> not
>>>>>>> deal with how/where it fits in the LLVM pipeline.
>>>>>>> The gist is that NVVMreflect replaces some
conditionals with
>> constants.
>>>>>>> libdevice uses that as a poor man's IR
preprocessor, conditionally
>>>>>> enabling
>>>>>>> different implementations and relying on DCE and
constant folding to
>>>>>> remove
>>>>>>> unused parts and eliminate the now useless
branches.
>>>>>>> While running NVVM alone will make libdevice code
valid and usable,
>> it
>>>>>>> would still benefit from further optimizations. I
do not know to what
>>>>>>> degree, though.
>>>>>>>
>>>>>>>
>>>>>>>> Link early:
>>>>>>>> 1) clang emits module.bc and links in
libdevice.bc but with the
>>>>>>>>         `optnone`, `noinline`, and
"used" attribute for functions in
>>>>>>>>         libdevice. ("used" is not an
attribute but could as well be.)
>>>>>>>>         At this stage module.bc might call
__nv_XXX or llvm.XXX
>> freely
>>>>>>>>         as defined by -ffast-math and friends.
>>>>>>>>
>>>>>>> That could work. Just carrying extra IR around
would probably be OK.
>>>>>>> We may want to do NVVMReflect as soon as we have it
linked in and,
>>>> maybe,
>>>>>>> allow optimizing the functions that are explicitly
used already.
>>>>>> Right. NVVMReflect can be run twice and with
`alwaysinline`
>>>>>> on the call sites of __nv_XXX functions we will
actually
>>>>>> inline and optimize them while the definitions are just
"dragged
>>>>>> along" in case we need them later.
>>>>>>
>>>>>>
>>>>>>>> 2) Run some optimizations in the middle end,
maybe till the end of
>>>>>>>>         the inliner loop, unsure.
>>>>>>>> 3) Run a libcall lowering pass and another
NVVMReflect pass (or the
>>>>>>>>         only instance thereof). We effectively
remove all llvm.XXX
>> calls
>>>>>>>         in favor of __nv_XXX now. Note that we
haven't spend (much)
>> time
>>>>>>>>         on the libdevice code as it is optnone
and most passes are
>> good
>>>>>>>>         at skipping those. To me, it's
unclear if the used parts
>> should
>>>>>>>>         not be optimized before we inline them
anyway to avoid
>> redoing
>>>>>>>>         the optimizations over and over (per
call site). That needs
>>>>>>>>         measuring I guess. Also note that we
can still retain the
>>>> current
>>>>>>>>         behavior for direct calls to __nv_XXX
if we mark the call
>> sites
>>>>>>>>         as `alwaysinline`, or at least the
behavior is almost like
>> the
>>>>>>>>         current one is.
>>>>>>>> 4) Run an always inliner pass on the __nv_XXX
calls because it is
>>>>>>>>         something we would do right now.
Alternatively, remove
>> `optnone`
>>>>>>>>         and `noinline` from the __nv_XXX calls.
>>>>>>>> 5) Continue with the pipeline as before.
>>>>>>>>
>>>>>>>>
>>>>>>> SGTM.
>>>>>>>
>>>>>>>
>>>>>>>> As mentioned above, `optnone` avoids spending
time on the libdevice
>>>>>>>> until we "activate" it. At that point
(globals) DCE can be scheduled
>>>>>>>> to remove all unused parts right away. I
don't think this is (much)
>>>>>>>> more expensive than linking libdevice early
right now.
>>>>>>>>
>>>>>>>> Link late, aka. translation units without
libdevice:
>>>>>>>> 1) clang emits module.bc but does not link in
libdevice.bc, it will
>> be
>>>>>>>>         made available later. We still can mix
__nv_XXX and llvm.XXX
>>>> calls
>>>>>>>>         freely as above.
>>>>>>>> 2) Same as above.
>>>>>>>> 3) Same as above.
>>>>>>>> 4) Same as above but effectively a no-op, no
__nv_XXX definitions
>> are
>>>>>>>>         available.
>>>>>>>> 5) Same as above.
>>>>>>>>
>>>>>>>>
>>>>>>>> I might misunderstand something about the
current pipeline but from
>>>> [0]
>>>>>>>> and the experiments I run locally it looks like
the above should
>> cover
>>>>>> all
>>>>>>>> the cases. WDYT?
>>>>>>>>
>>>>>>>>
>>>>>>> The `optnone` trick may indeed remove much of the
practical
>> differences
>>>>>>> between the early/late approaches.
>>>>>>> In principle it should work.
>>>>>>>
>>>>>>> Next question is -- is libdevice sufficient to
satisfy LLVM's
>>>> assumptions
>>>>>>> about the standard library.
>>>>>>> While it does provide most of the equivalents of
libm functions, the
>>>> set
>>>>>> is
>>>>>>> not complete and some of the functions differ from
their libm
>>>>>> counterparts.
>>>>>>> The differences are minor, so we should be able to
deal with it by
>>>>>>> generating few wrapper functions for the odd cases.
>>>>>>> Here's what clang does to provide math
functions using libdevice:
>>>>>>>
>>
https://github.com/llvm/llvm-project/blob/main/clang/lib/Headers/__clang_cuda_math.h
>>>>>> Right now, clang will generate any llvm intrinsic and
we crash, so
>>>> anything
>>>>>> else is probably a step in the right direction.
Eventually, we should
>>>>>> "lower"
>>>>>> all intrinsics that the NVPTX backend can't handle
or at least emit a
>>>> nice
>>>>>> error message. Preferably, clang would know what we
can't deal with
>> and
>>>> not
>>>>>> generate intinsic calls for those in the first place.
>>>>>>
>>>>>>
>>>>>>> The most concerning aspect of libdevice is that we
don't know when
>>>> we'll
>>>>>> no
>>>>>>> longer be able to use the libdevice bitcode? My
understanding is that
>>>> IR
>>>>>>> does not guarantee binary stability and at some
point we may just be
>>>>>> unable
>>>>>>> to use it. Ideally we need our own libm for GPUs.
>>>>>> For OpenMP I did my best to avoid writing libm (code)
for GPUs by
>> piggy
>>>>>> backing on CUDA and libc++ implementations, I hope it
will stay that
>>>> way.
>>>>>> That said, if the need arises we might really have to
port libc++ to
>> the
>>>>>> GPUs.
>>>>>>
>>>>>> Back to the problem with libdevice. I agree that the
solution of
>> NVIDIA
>>>>>> to ship a .bc library is suboptimal but with the
existing, or an
>>>> extended,
>>>>>> auto-upgrader we might be able to make that work
reasonably well for
>> the
>>>>>> foreseeable future. That problem is orthogonal to what
we are
>> discussing
>>>>>> above, I think.
>>>>>>
>>>>>> ~ Johannes
>>>>>>
>>>>>>
>>>>>>> --Artem
>>>>>>>
>>>>>>>
>>>>>>>> ~ Johannes
>>>>>>>>
>>>>>>>>
>>>>>>>> P.S. If the rewrite capability (aka libcall
lowering) is generic we
>>>>>> could
>>>>>>>>           use the scheme for many other things
as well.
>>>>>>>>
>>>>>>>>
>>>>>>>> [0]
https://llvm.org/docs/NVPTXUsage.html#linking-with-libdevice
>>>>>>>>
>>>>>>>>
>>>>>>>>> --Artem
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> ~ Johannes
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> One thing that may work within the
existing compilation model is
>> to
>>>>>>>>>>> pre-compile the standard library
into PTX and then textually
>> embed
>>>>>>>>>> relevant
>>>>>>>>>>> functions into the generated PTX,
thus pushing the 'linking'
>> phase
>>>>>> past
>>>>>>>>>> the
>>>>>>>>>>> end of LLVM's compilation and
make it look closer to the standard
>>>>>>>>>>> compile/link process. This way
we'd only enable libcall lowering
>> in
>>>>>>>>>> NVPTX,
>>>>>>>>>>> assuming that the library functions
will be magically available
>> out
>>>>>>>>>> there.
>>>>>>>>>>> Injection of PTX could be done with
an external script outside of
>>>>>> LLVM
>>>>>>>>>> and
>>>>>>>>>>> it could be incorporated into clang
driver. Bonus points for the
>>>> fact
>>>>>>>>>> that
>>>>>>>>>>> this scheme is compatible with
-fgpu-rdc out of the box --
>> assemble
>>>>>> the
>>>>>>>>>> PTX
>>>>>>>>>>> with `ptxas -rdc` and then actually
link with the library,
>> instead
>>>> of
>>>>>>>>>>> injecting its PTX before invoking
ptxas.
>>>>>>>>>>>
>>>>>>>>>>> --Artem
>>>>>>>>>>>
>>>>>>>>>>> Trying to figure out a good way to
have the cake and eat it too.
>>>>>>>>>>>> ~ Johannes
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On 3/10/21 2:49 PM, William
Moses wrote:
>>>>>>>>>>>>> Since clang (and arguably
any other frontend that uses) should
>>>> link
>>>>>>>> in
>>>>>>>>>>>>> libdevice, could we lower
these intrinsics to the libdevice
>> code?
>>>>>>>>>>> The linking happens *before* LLVM
gets to work on IR.
>>>>>>>>>>> As I said, it's a workaround,
not the solution. It's possible for
>>>>>> LLVM
>>>>>>>> to
>>>>>>>>>>> still attempt lowering something in
the IR into a libcall and we
>>>>>> would
>>>>>>>>>> not
>>>>>>>>>>> be able to deal with that. It
happens to work well enough in
>>>>>> practice.
>>>>>>>>>>> Do you have an example where you
see the problem with
>> -ffast-math?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>> For example, consider
compiling the simple device function
>> below:
>>>>>>>>>>>>> ```
>>>>>>>>>>>>> //
/mnt/sabrent/wmoses/llvm13/build/bin/clang tmp.cu -S
>>>> -emit-llvm
>>>>>>>>>>>>>        
--cuda-path=/usr/local/cuda-11.0
>>>> -L/usr/local/cuda-11.0/lib64
>>>>>>>>>>>>> --cuda-gpu-arch=sm_37
>>>>>>>>>>>>> __device__ double f(double
x) {
>>>>>>>>>>>>>            return cos(x);
>>>>>>>>>>>>> }
>>>>>>>>>>>>> ```
>>>>>>>>>>>>>
>>>>>>>>>>>>> The LLVM module for it is
as follows:
>>>>>>>>>>>>>
>>>>>>>>>>>>> ```
>>>>>>>>>>>>> ...
>>>>>>>>>>>>> define dso_local double
@_Z1fd(double %x) #0 {
>>>>>>>>>>>>> entry:
>>>>>>>>>>>>>          %__a.addr.i =
alloca double, align 8
>>>>>>>>>>>>>          %x.addr = alloca
double, align 8
>>>>>>>>>>>>>          store double %x,
double* %x.addr, align 8
>>>>>>>>>>>>>          %0 = load double,
double* %x.addr, align 8
>>>>>>>>>>>>>          store double %0,
double* %__a.addr.i, align 8
>>>>>>>>>>>>>          %1 = load double,
double* %__a.addr.i, align 8
>>>>>>>>>>>>>          %call.i = call
contract double @__nv_cos(double %1) #7
>>>>>>>>>>>>>          ret double %call.i
>>>>>>>>>>>>> }
>>>>>>>>>>>>>
>>>>>>>>>>>>> define internal double
@__nv_cos(double %a) #1 {
>>>>>>>>>>>>>          %q.i = alloca i32,
align 4
>>>>>>>>>>>>> ```
>>>>>>>>>>>>>
>>>>>>>>>>>>> Obviously we would need to
do something to ensure these
>> functions
>>>>>>>> don't
>>>>>>>>>>>> get
>>>>>>>>>>>>> deleted prior to their use
in lowering from intrinsic to
>>>> libdevice.
>>>>>>>>>>>>> ...
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Wed, Mar 10, 2021 at
3:39 PM Artem Belevich <tra at google.com
>>>>>>>> wrote:
>>>>>>>>>>>>>> On Wed, Mar 10, 2021 at
11:41 AM Johannes Doerfert <
>>>>>>>>>>>>>> johannesdoerfert at
gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Artem, Justin,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I am running into a
problem and I'm curious if I'm missing
>>>>>>>> something
>>>>>>>>>> or
>>>>>>>>>>>>>>> if the support is
simply missing.
>>>>>>>>>>>>>>> Am I correct to
assume the NVPTX backend does not deal with
>>>>>>>>>> `llvm.sin`
>>>>>>>>>>>>>>> and friends?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Correct. It can't
deal with anything that may need to lower
>> to a
>>>>>>>>>>>> standard
>>>>>>>>>>>>>> library call.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> This is what I see,
with some variations:
>>>>>>>>>> https://godbolt.org/z/PxsEWs
>>>>>>>>>>>>>>> If this is missing
in the backend, is there a plan to get
>> this
>>>>>>>>>> working,
>>>>>>>>>>>>>>> I'd really like
to have the
>>>>>>>>>>>>>>> intrinsics in the
middle end rather than __nv_cos, not to
>>>> mention
>>>>>>>>>> that
>>>>>>>>>>>>>>> -ffast-math does
emit intrinsics
>>>>>>>>>>>>>>> and crashes.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> It all boils down to
the fact that PTX does not have the
>>>> standard
>>>>>>>>>>>>>> libc/libm which LLVM
could lower the calls to, nor does it
>> have
>>>> a
>>>>>>>>>>>> 'linking'
>>>>>>>>>>>>>> phase where we could
link such a library in, if we had it.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Libdevice bitcode does
provide the implementations for some of
>>>> the
>>>>>>>>>>>>>> functions (though with
a __nv_ prefix) and clang links it in
>> in
>>>>>>>> order
>>>>>>>>>> to
>>>>>>>>>>>>>> avoid generating IR
that LLVM can't handle, but that's a
>>>>>> workaround
>>>>>>>>>> that
>>>>>>>>>>>>>> does not help LLVM
itself.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> --Artem
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> ~ Johannes
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>> ───────────────────
>>>>>>>>>>>>>>> ∽ Johannes (he/his)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> --Artem Belevich
>>>>>>>>>>>>>>
>

Johannes Doerfert via llvm-dev

2021-Mar-12 16:48 UTC

head link

[llvm-dev] NVPTX codegen for llvm.sin (and friends)

I prototyped the LLVM-Core parts last night:

https://reviews.llvm.org/D98516

If this is something we support I'll write an RFC, also
for the missing clang parts.

~ Johannes

[EOM]


On 3/11/21 7:59 PM, Artem Belevich wrote:> On Thu, Mar 11, 2021 at 4:10 PM Johannes Doerfert <
> johannesdoerfert at gmail.com> wrote:
>
>> On 3/11/21 1:37 PM, Artem Belevich wrote:
>>> On Thu, Mar 11, 2021 at 10:54 AM Johannes Doerfert <
>>> johannesdoerfert at gmail.com> wrote:
>>>
>>>> I certainly agree we should try to avoid a hard-coded mapping
>>>> in C++.
>>>>
>>>> I could see something like:
>>>>
>>>> ```
>>>> __attribute__((implementation("llvm.cos"))
>>>> double __nv_cos(...) { ... }
>>>>
>>>> ```
>>>>
>>>> and a pass that transforms all calls to a function with an
>>>> "implementation" to calls to that implementation.
Maybe
>>>> later we attach a score/priority ;)
>>>>
>>>>
>>> I'm not sure how that would work.
>>> Where would you place that `__attribute__((implementation))` ? We
do not
>>> have the definitions for `__nv_*` as they come from NVIDIA-provided
>>> bitcode. We could add the attribute to the declaration in
>>> `__clang_cuda_libdevice_declares.h`.
>>> How does LLVM handle the differences in function attributes between
>>> function declaration and definition? Will there be trouble when we
link
>> in
>>> the actual __nv_cos from the libdevice that would not have that
>> attribute?
>>> Another potential gotcha is that for the functions that can't
be directly
>>> mapped 1:1 to `__nv_*` counterparts, we'd still need to provide
the
>>> implementation ourselves. We will not know whether the
implementation
>> will
>>> be used until after the substitution pass, so we'll need to
make sure
>> it's
>>> not DCE'd until then. It appears to be the same issue (though
on a
>> smaller
>>> scale) as with linking in libdevice directly.
>>>
>>> Let's take a step back and figure out what are the issues we
want to
>> solve.
>>> The top-level goal is to provide implementation for LLVM
intrinsics. For
>>> now let's stick with libm-related ones.
>>> What we have is the libdevice bitcode which uses different function
names
>>> and provides a subset of the functionality we need.
>>> What we miss is
>>>     - something to connect LLVM's libcalls to the GPU-side
implementation,
>>>     - additional code to provide implementations for the functions
that
>> are
>>> missing or different in libdevice.
>>>
>>> Considering that we want this to work in LLVM, the additional code
would
>>> have to be a bitcode and it would have to exist in addition to
libdevice.
>>> Our options for the mapping between LLVM intrinsics and the
>> implementation
>>> are
>>> * intrinsic -> __nv_* equivalent mapping pass
>>>      This would still need additional bitcode for the
missing/different
>>> functions.
>>> * lower libcalls to the standard libm APIs, implement libm ->
__nv_*
>>> mapping in our own bitcode.
>>>
>>> Considering that additional bitcode is needed in both cases, I
believe
>> that
>>> the second approach makes more sense.
>> I really hope to avoid any additional bitcode, there are too many
>> drawbacks and basically no benefits, IMHO.
>>
> Could you elaborate on the drawbacks?
>
> The fact is that we already depend on the external bitcode (libdevice in
> this case), though right now we're trying to keep that to clang only.
The
> current approach is not sound in principle and is rather brittle in
> practice. Nor clang is the only source of the IR for the LLVM to
> compile, so it leaves LLVM-only users without a good solution. There are
> already a handful of JIT compilers that each do their own gluing of
> libdevice into the IR they want to compile for NVPTX. I think we do have a
> very good reason to deal with that in LLVM itself.
>
> While I agree that additional bitcode is a hassle, I think it would be a
> net positive change for LLVM usability for NVPTX users.
> The external bitcode would not be required for those who do not need
> libdevice now, so the change should not be disruptive.
>
>>> LLVM does not need to know or care about what's provided by
libdevice,
>> and
>>> we'd have more flexibility, compared to what we could do in the
mapping
>>> pass. It also makes it easy to substitute a different
implementation, if
>> we
>>> have or need one.
>> I agree that LLVM (core) should not know about __nv_*, that's why I
>> suggested
>> the `__attribute__((implements("...")))` approach. My
preferred solution
>> is still to annotate our declarations of __nv_* and point to the
>> llvm.intrinsics (name) from there. If we have a missing mapping, we
>> point to an
>> intrinsic from a definition that lives in the Clang headers next to the
>> __nv_*  declarations.
>>
> We may have slightly different end goals in mind.
> I was thinking of making the solution work for LLVM. I.e. users would be
> free to use llvm.sin with NVPTX back-end with a few documented steps needed
> to make it work (basically "pass additional
> -link-libm-bitcode=path/to/bitcode_libm.bc").
>
> Your scenario above suggests that the goal is to allow clang to generate
> both llvm intrinsics and the glue which would then be used by LLVM to make
> it work for clang, but not in general. It's an improvement compared to
what
> we have now, but I still think we should try a more general solution.
>
>
>> This does not yet work because -mlink-builtin-bitcode (which I assume
>> triggers the llvm-link logic) will drop the attributes of a declaration
>> if a definition is found. I think that should not be the case anyway
>> such that the union of attributes is set.
>>
>> The benefit I see for the above is that the mapping is tied to the
>> declarations and doesn't live in a tablegen file far away. It works
well
>> even if we can't map 1:1, and we could even restrict the
"used" attribute
>> to anything that has an "implements" attribute.
>
> I do not think we need tablegen for anything here. I was thinking of just
> compiling a real math library (or a wrapper on top of libdevice) from C/C++
> sources.
>
> Our approaches are not mutually exclusive. If there's a strong
opposition
> to providing a bitcode libm for NVPTX, implementing it somewhere closer to
> clang would still be an improvement, even if it's not as general as
I'd
> like. It should still be possible to allow LLVM to lower libcalls in NVPTX
> to standard libm API, enabled with a flag, and just let the end users who
> are interested (e.g. JITs) to provide their own implementation.
>
> --Artem
>
>
>
>> So:
>>
>> ```
>> __nv_A() { ... } // called, inlined and optimized as before, DCE'ed
after.
>>
>> __nv_B() { ... } // not called, DCE'ed.
>>
>> __attribute__((implements("llvm.C"))
>> __nv_C() { ... } // calls are inlined and optimized as before, not
DCE'ed
>>                    // though because of the attribute. Replaces llvm.C
as
>>                    // callee in the special pass.
>> ```
>>
>> So "implements" gives you a way to statically replace a
function
>> declaration
>> or definition with another one. I could see it being used to provide
other
>> intrinsics to platforms with backends that don't support them.
>>
>> Does that make some sense?
>>
>> ~ Johannes
>>
>>
>>> WDYT?
>>>
>>> --Artem
>>>
>>>
>>>> On 3/10/21 8:44 PM, William Moses wrote:
>>>>> We could also consider doing something slightly broader.
>>>>>
>>>>> For example we could define a special attribute on top of
the llvm.cos
>>>>> call/declaration etc with metadata or an attribute that
points to the
>>>>> actual __nv_cos function. Then in a subsequent lowering
pass the
>>>>> corresponding intrinsic with the relevant attribute has its
uses
>> replaced
>>>>> by the actual function.
>>>>>
>>>>>
>>>>> On Wed, Mar 10, 2021 at 7:57 PM Johannes Doerfert <
>>>>> johannesdoerfert at gmail.com> wrote:
>>>>>
>>>>>> On 3/10/21 6:22 PM, Artem Belevich wrote:
>>>>>>> On Wed, Mar 10, 2021 at 3:44 PM Johannes Doerfert
<
>>>>>>> johannesdoerfert at gmail.com> wrote:
>>>>>>>
>>>>>>>> On 3/10/21 4:38 PM, Artem Belevich wrote:
>>>>>>>>> On Wed, Mar 10, 2021 at 1:55 PM Johannes
Doerfert <
>>>>>>>>> johannesdoerfert at gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> On 3/10/21 3:25 PM, Artem Belevich
wrote:
>>>>>>>>>>> On Wed, Mar 10, 2021 at 12:57 PM
Johannes Doerfert <
>>>>>>>>>>> johannesdoerfert at gmail.com>
wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Right. We could keep the
definition of __nv_cos and friends
>>>>>>>>>>>> around. Right now, -ffast-math
might just crash on the user,
>>>>>>>>>>>> which is arguably a bad thing.
I can also see us benefiting
>>>>>>>>>>>> in various other ways from
llvm.cos uses instead of __nv_cos
>>>>>>>>>>>> (assuming precision is
according to the user requirements but
>>>>>>>>>>>> that is always a condition).
>>>>>>>>>>>>
>>>>>>>>>>>> It could be as simple as
introducing __nv_cos into
>>>>>>>>>>>> "llvm.used" and a
backend matching/rewrite pass.
>>>>>>>>>>>>
>>>>>>>>>>>> If the backend knew the
libdevice location it could even pick
>>>>>>>>>>>> the definitions from there.
Maybe we could link libdevice late
>>>>>>>>>>>> instead of eager?
>>>>>>>>>>>>
>>>>>>>>>>> It's possible, but it would
require plumbing in CUDA SDK
>> awareness
>>>>>> into
>>>>>>>>>>> LLVM. While clang driver can deal
with that, LLVM currently
>> can't.
>>>>>> The
>>>>>>>>>>> bitcode library path would have to
be provided by the user.
>>>>>>>>>> The PTX backend could arguably be CUDA
SDK aware, IMHO, it would
>>>>>>>>>> even be fine if the middle-end does the
remapping to get inlining
>>>>>>>>>> and folding benefits also after
__nv_cos is used. See below.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> The standard library as bitcode
raises some questions.
>>>>>>>>>> Which standard library? CUDAs libdevice
is a bitcode library,
>> right?
>>>>>>>>> It's whatever LLVM will need to lower
libcalls to. libdevice
>> bitcode
>>>> is
>>>>>>>> the
>>>>>>>>> closest approximation of that we have at
the moment.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>> * When do we want to do the
linking? If we do it at the
>> beginning,
>>>>>> then
>>>>>>>>>> the
>>>>>>>>>>> question is how to make sure unused
functions are not eliminated
>>>>>> before
>>>>>>>>>> we
>>>>>>>>>>> may need them, as we don't know
apriori what's going to be
>> needed.
>>>> We
>>>>>>>>>> also
>>>>>>>>>>> do want the unused functions to be
gone after we're done. Linking
>>>> it
>>>>>> in
>>>>>>>>>>> early would allow optimizing the
code better at the expense of
>>>> having
>>>>>>>> to
>>>>>>>>>>> optimize a lot of code we'll
throw away. Linking it in late has
>>>> less
>>>>>>>>>>> overhead, but leaves the linked in
bitcode unoptimized, though
>> it's
>>>>>>>>>>> probably in the ballpark of what
would happen with a real library
>>>>>> call.
>>>>>>>>>>> I.e. no inlining, etc.
>>>>>>>>>>>
>>>>>>>>>>> * It incorporates linking into
LLVM, which is not LLVM's job.
>>>>>> Arguably,
>>>>>>>>>> the
>>>>>>>>>>> line should be drawn at the
lowering to libcalls as it's done for
>>>>>> other
>>>>>>>>>>> back-ends. However, we're also
constrained to by the need to have
>>>> the
>>>>>>>>>>> linking done before we generate PTX
which prevents doing it after
>>>>>> LLVM
>>>>>>>> is
>>>>>>>>>>> done generating an object file.
>>>>>>>>>> I'm confused. Clang links in
libdevice.bc early.
>>>>>>>>> Yes. Because that's where it has to
happen if we want to keep LLVM
>>>>>>>> unaware
>>>>>>>>> of CUDA SDK.
>>>>>>>>> It does not have to be the case if/when
LLVM can do the linking
>>>> itself.
>>>>>>>>>> If we make sure
>>>>>>>>>> `__nv_cos` is not deleted early, we can
at any point "lower"
>>>>>> `llvm.cos`
>>>>>>>>>> to `__nv_cos` which is available. After
the lowering we can remove
>>>>>>>>>> the artificial uses of `__nv_XXX`
functions that we used to keep
>> the
>>>>>>>>>> definitions around in order to remove
them from the final result.
>>>>>>>>>>
>>>>>>>>> This is the 'link early' approach,
I should've been explicit that
>>>> it's
>>>>>>>>> 'link early *everything*' as
opposed to linking only what's needed
>> at
>>>>>> the
>>>>>>>>> beginning.
>>>>>>>>> It would work at the expense of having to
process/optimize 500KB
>>>> worth
>>>>>> of
>>>>>>>>> bitcode for every compilation, whether it
needs it or not.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> We get the benefit of having `llvm.cos`
for some of the pipeline,
>>>>>>>>>> we know it does not have all the bad
effects while `__nv_cos` is
>>>>>> defined
>>>>>>>>>> with inline assembly. We also get the
benefit of inlining
>> `__nv_cos`
>>>>>>>>>> and folding the implementation based on
the arguments. Finally,
>>>>>>>>>> this should work with the existing
pipeline, the linking is the
>> same
>>>>>>>>>> as before, all we do is to keep the
definitions alive longer and
>>>>>>>>>> lower `llvm.cos` to `__nv_cos` in a
middle end pass.
>>>>>>>>>>
>>>>>>>>> Again, I agree that it is doable.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> This might be similar to the PTX
solution you describe below but I
>>>>>> feel
>>>>>>>>>> we get the inline benefit from this
without actually changing the
>>>>>>>> pipeline
>>>>>>>>>> at all.
>>>>>>>>>>
>>>>>>>>> So, to summarize:
>>>>>>>>> * link the library as bitcode early, add
artificial placeholders
>> for
>>>>>>>>> everything, compile, remove placeholders
and DCE unused stuff away.
>>>>>>>>>        Pros:
>>>>>>>>>           - we're already doing most of
it before clang hands hands
>> off
>>>>>> IR to
>>>>>>>>> LLVM, so it just pushes it a bit lower in
the compilation.
>>>>>>>>>        Cons:
>>>>>>>>>           - runtime cost of optimizing
libdevice bitcode,
>>>>>>>>>           - libdevice may be required for
all NVPTX compilations?
>>>>>>>>>
>>>>>>>>> * link the library as bitcode late.
>>>>>>>>>         Pros:
>>>>>>>>>           - lower runtime cost than
link-early approach.
>>>>>>>>>         Cons:
>>>>>>>>>           - We'll need to make sure
that NVVMReflect pass processes
>> the
>>>>>>>> library.
>>>>>>>>>           - less optimizations on the
library functions. Some of the
>>>> code
>>>>>>>> gets
>>>>>>>>> DCE'ed away after NVVMReflect and the
rest could be optimized
>> better.
>>>>>>>>>           - libdevice may be required for
all NVPTX compilations?
>>>>>>>>> * 'link' with the library as PTX
appended as text to LLVM's output
>>>> and
>>>>>>>> let
>>>>>>>>> ptxas do the 'linking'
>>>>>>>>>        Pros:  LLVM remains agnostic of CUDA
SDK installation
>> details.
>>>> All
>>>>>> it
>>>>>>>>> does is allows lowering libcalls and leaves
their resolution to the
>>>>>>>>> external tools.
>>>>>>>>>        Cons: Need to have the PTX library
somewhere and need to
>>>> integrate
>>>>>> the
>>>>>>>>> 'linking' into the compilation
process somehow.
>>>>>>>>>
>>>>>>>>> Neither is particularly good. If the
runtime overhead of link-early
>>>> is
>>>>>>>>> acceptable, then it may be a winner here,
by a very small margin.
>>>>>>>>> link-as-PTX may be better conceptually as
it keeps linking and
>>>>>>>> compilation
>>>>>>>>> separate.
>>>>>>>>>
>>>>>>>>> As for the practical steps, here's what
we need:
>>>>>>>>> - allow libcall lowering in NVPTX, possibly
guarded by a flag. This
>>>> is
>>>>>>>>> needed for all of the approaches above.
>>>>>>>>> - teach LLVM how to link in bitcode (and,
possibly, control
>>>> early/late
>>>>>>>> mode)
>>>>>>>>> - teach clang driver to delegate libdevice
linking to LLVM.
>>>>>>>>>
>>>>>>>>> This will allow us to experiment with all
three approaches and see
>>>> what
>>>>>>>>> works best.
>>>>>>>> I think if we embed knowledge about the nv_XXX
functions we can
>>>>>>>> even get away without the cons you listed for
early linking above.
>>>>>>>>
>>>>>>> WDYM by `embed knowledge about the nv_XXX
functions`? By linking
>> those
>>>>>>> functions in? Of do you mean that we should just
declare them
>>>>>>> before/instead of linking libdevice in?
>>>>>> I mean by providing the "libcall lowering"
pass. So the knowledge
>>>>>> that llvm.cos maps to __nv_cos.
>>>>>>
>>>>>>>> For early link I'm assuming an order
similar to [0] but I also
>> discuss
>>>>>>>> the case where we don't link libdevice
early for a TU.
>>>>>>>>
>>>>>>> That link just describes the steps needed to use
libdevice. It does
>> not
>>>>>>> deal with how/where it fits in the LLVM pipeline.
>>>>>>> The gist is that NVVMreflect replaces some
conditionals with
>> constants.
>>>>>>> libdevice uses that as a poor man's IR
preprocessor, conditionally
>>>>>> enabling
>>>>>>> different implementations and relying on DCE and
constant folding to
>>>>>> remove
>>>>>>> unused parts and eliminate the now useless
branches.
>>>>>>> While running NVVM alone will make libdevice code
valid and usable,
>> it
>>>>>>> would still benefit from further optimizations. I
do not know to what
>>>>>>> degree, though.
>>>>>>>
>>>>>>>
>>>>>>>> Link early:
>>>>>>>> 1) clang emits module.bc and links in
libdevice.bc but with the
>>>>>>>>         `optnone`, `noinline`, and
"used" attribute for functions in
>>>>>>>>         libdevice. ("used" is not an
attribute but could as well be.)
>>>>>>>>         At this stage module.bc might call
__nv_XXX or llvm.XXX
>> freely
>>>>>>>>         as defined by -ffast-math and friends.
>>>>>>>>
>>>>>>> That could work. Just carrying extra IR around
would probably be OK.
>>>>>>> We may want to do NVVMReflect as soon as we have it
linked in and,
>>>> maybe,
>>>>>>> allow optimizing the functions that are explicitly
used already.
>>>>>> Right. NVVMReflect can be run twice and with
`alwaysinline`
>>>>>> on the call sites of __nv_XXX functions we will
actually
>>>>>> inline and optimize them while the definitions are just
"dragged
>>>>>> along" in case we need them later.
>>>>>>
>>>>>>
>>>>>>>> 2) Run some optimizations in the middle end,
maybe till the end of
>>>>>>>>         the inliner loop, unsure.
>>>>>>>> 3) Run a libcall lowering pass and another
NVVMReflect pass (or the
>>>>>>>>         only instance thereof). We effectively
remove all llvm.XXX
>> calls
>>>>>>>         in favor of __nv_XXX now. Note that we
haven't spend (much)
>> time
>>>>>>>>         on the libdevice code as it is optnone
and most passes are
>> good
>>>>>>>>         at skipping those. To me, it's
unclear if the used parts
>> should
>>>>>>>>         not be optimized before we inline them
anyway to avoid
>> redoing
>>>>>>>>         the optimizations over and over (per
call site). That needs
>>>>>>>>         measuring I guess. Also note that we
can still retain the
>>>> current
>>>>>>>>         behavior for direct calls to __nv_XXX
if we mark the call
>> sites
>>>>>>>>         as `alwaysinline`, or at least the
behavior is almost like
>> the
>>>>>>>>         current one is.
>>>>>>>> 4) Run an always inliner pass on the __nv_XXX
calls because it is
>>>>>>>>         something we would do right now.
Alternatively, remove
>> `optnone`
>>>>>>>>         and `noinline` from the __nv_XXX calls.
>>>>>>>> 5) Continue with the pipeline as before.
>>>>>>>>
>>>>>>>>
>>>>>>> SGTM.
>>>>>>>
>>>>>>>
>>>>>>>> As mentioned above, `optnone` avoids spending
time on the libdevice
>>>>>>>> until we "activate" it. At that point
(globals) DCE can be scheduled
>>>>>>>> to remove all unused parts right away. I
don't think this is (much)
>>>>>>>> more expensive than linking libdevice early
right now.
>>>>>>>>
>>>>>>>> Link late, aka. translation units without
libdevice:
>>>>>>>> 1) clang emits module.bc but does not link in
libdevice.bc, it will
>> be
>>>>>>>>         made available later. We still can mix
__nv_XXX and llvm.XXX
>>>> calls
>>>>>>>>         freely as above.
>>>>>>>> 2) Same as above.
>>>>>>>> 3) Same as above.
>>>>>>>> 4) Same as above but effectively a no-op, no
__nv_XXX definitions
>> are
>>>>>>>>         available.
>>>>>>>> 5) Same as above.
>>>>>>>>
>>>>>>>>
>>>>>>>> I might misunderstand something about the
current pipeline but from
>>>> [0]
>>>>>>>> and the experiments I run locally it looks like
the above should
>> cover
>>>>>> all
>>>>>>>> the cases. WDYT?
>>>>>>>>
>>>>>>>>
>>>>>>> The `optnone` trick may indeed remove much of the
practical
>> differences
>>>>>>> between the early/late approaches.
>>>>>>> In principle it should work.
>>>>>>>
>>>>>>> Next question is -- is libdevice sufficient to
satisfy LLVM's
>>>> assumptions
>>>>>>> about the standard library.
>>>>>>> While it does provide most of the equivalents of
libm functions, the
>>>> set
>>>>>> is
>>>>>>> not complete and some of the functions differ from
their libm
>>>>>> counterparts.
>>>>>>> The differences are minor, so we should be able to
deal with it by
>>>>>>> generating few wrapper functions for the odd cases.
>>>>>>> Here's what clang does to provide math
functions using libdevice:
>>>>>>>
>>
https://github.com/llvm/llvm-project/blob/main/clang/lib/Headers/__clang_cuda_math.h
>>>>>> Right now, clang will generate any llvm intrinsic and
we crash, so
>>>> anything
>>>>>> else is probably a step in the right direction.
Eventually, we should
>>>>>> "lower"
>>>>>> all intrinsics that the NVPTX backend can't handle
or at least emit a
>>>> nice
>>>>>> error message. Preferably, clang would know what we
can't deal with
>> and
>>>> not
>>>>>> generate intinsic calls for those in the first place.
>>>>>>
>>>>>>
>>>>>>> The most concerning aspect of libdevice is that we
don't know when
>>>> we'll
>>>>>> no
>>>>>>> longer be able to use the libdevice bitcode? My
understanding is that
>>>> IR
>>>>>>> does not guarantee binary stability and at some
point we may just be
>>>>>> unable
>>>>>>> to use it. Ideally we need our own libm for GPUs.
>>>>>> For OpenMP I did my best to avoid writing libm (code)
for GPUs by
>> piggy
>>>>>> backing on CUDA and libc++ implementations, I hope it
will stay that
>>>> way.
>>>>>> That said, if the need arises we might really have to
port libc++ to
>> the
>>>>>> GPUs.
>>>>>>
>>>>>> Back to the problem with libdevice. I agree that the
solution of
>> NVIDIA
>>>>>> to ship a .bc library is suboptimal but with the
existing, or an
>>>> extended,
>>>>>> auto-upgrader we might be able to make that work
reasonably well for
>> the
>>>>>> foreseeable future. That problem is orthogonal to what
we are
>> discussing
>>>>>> above, I think.
>>>>>>
>>>>>> ~ Johannes
>>>>>>
>>>>>>
>>>>>>> --Artem
>>>>>>>
>>>>>>>
>>>>>>>> ~ Johannes
>>>>>>>>
>>>>>>>>
>>>>>>>> P.S. If the rewrite capability (aka libcall
lowering) is generic we
>>>>>> could
>>>>>>>>           use the scheme for many other things
as well.
>>>>>>>>
>>>>>>>>
>>>>>>>> [0]
https://llvm.org/docs/NVPTXUsage.html#linking-with-libdevice
>>>>>>>>
>>>>>>>>
>>>>>>>>> --Artem
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> ~ Johannes
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> One thing that may work within the
existing compilation model is
>> to
>>>>>>>>>>> pre-compile the standard library
into PTX and then textually
>> embed
>>>>>>>>>> relevant
>>>>>>>>>>> functions into the generated PTX,
thus pushing the 'linking'
>> phase
>>>>>> past
>>>>>>>>>> the
>>>>>>>>>>> end of LLVM's compilation and
make it look closer to the standard
>>>>>>>>>>> compile/link process. This way
we'd only enable libcall lowering
>> in
>>>>>>>>>> NVPTX,
>>>>>>>>>>> assuming that the library functions
will be magically available
>> out
>>>>>>>>>> there.
>>>>>>>>>>> Injection of PTX could be done with
an external script outside of
>>>>>> LLVM
>>>>>>>>>> and
>>>>>>>>>>> it could be incorporated into clang
driver. Bonus points for the
>>>> fact
>>>>>>>>>> that
>>>>>>>>>>> this scheme is compatible with
-fgpu-rdc out of the box --
>> assemble
>>>>>> the
>>>>>>>>>> PTX
>>>>>>>>>>> with `ptxas -rdc` and then actually
link with the library,
>> instead
>>>> of
>>>>>>>>>>> injecting its PTX before invoking
ptxas.
>>>>>>>>>>>
>>>>>>>>>>> --Artem
>>>>>>>>>>>
>>>>>>>>>>> Trying to figure out a good way to
have the cake and eat it too.
>>>>>>>>>>>> ~ Johannes
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On 3/10/21 2:49 PM, William
Moses wrote:
>>>>>>>>>>>>> Since clang (and arguably
any other frontend that uses) should
>>>> link
>>>>>>>> in
>>>>>>>>>>>>> libdevice, could we lower
these intrinsics to the libdevice
>> code?
>>>>>>>>>>> The linking happens *before* LLVM
gets to work on IR.
>>>>>>>>>>> As I said, it's a workaround,
not the solution. It's possible for
>>>>>> LLVM
>>>>>>>> to
>>>>>>>>>>> still attempt lowering something in
the IR into a libcall and we
>>>>>> would
>>>>>>>>>> not
>>>>>>>>>>> be able to deal with that. It
happens to work well enough in
>>>>>> practice.
>>>>>>>>>>> Do you have an example where you
see the problem with
>> -ffast-math?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>> For example, consider
compiling the simple device function
>> below:
>>>>>>>>>>>>> ```
>>>>>>>>>>>>> //
/mnt/sabrent/wmoses/llvm13/build/bin/clang tmp.cu -S
>>>> -emit-llvm
>>>>>>>>>>>>>        
--cuda-path=/usr/local/cuda-11.0
>>>> -L/usr/local/cuda-11.0/lib64
>>>>>>>>>>>>> --cuda-gpu-arch=sm_37
>>>>>>>>>>>>> __device__ double f(double
x) {
>>>>>>>>>>>>>            return cos(x);
>>>>>>>>>>>>> }
>>>>>>>>>>>>> ```
>>>>>>>>>>>>>
>>>>>>>>>>>>> The LLVM module for it is
as follows:
>>>>>>>>>>>>>
>>>>>>>>>>>>> ```
>>>>>>>>>>>>> ...
>>>>>>>>>>>>> define dso_local double
@_Z1fd(double %x) #0 {
>>>>>>>>>>>>> entry:
>>>>>>>>>>>>>          %__a.addr.i =
alloca double, align 8
>>>>>>>>>>>>>          %x.addr = alloca
double, align 8
>>>>>>>>>>>>>          store double %x,
double* %x.addr, align 8
>>>>>>>>>>>>>          %0 = load double,
double* %x.addr, align 8
>>>>>>>>>>>>>          store double %0,
double* %__a.addr.i, align 8
>>>>>>>>>>>>>          %1 = load double,
double* %__a.addr.i, align 8
>>>>>>>>>>>>>          %call.i = call
contract double @__nv_cos(double %1) #7
>>>>>>>>>>>>>          ret double %call.i
>>>>>>>>>>>>> }
>>>>>>>>>>>>>
>>>>>>>>>>>>> define internal double
@__nv_cos(double %a) #1 {
>>>>>>>>>>>>>          %q.i = alloca i32,
align 4
>>>>>>>>>>>>> ```
>>>>>>>>>>>>>
>>>>>>>>>>>>> Obviously we would need to
do something to ensure these
>> functions
>>>>>>>> don't
>>>>>>>>>>>> get
>>>>>>>>>>>>> deleted prior to their use
in lowering from intrinsic to
>>>> libdevice.
>>>>>>>>>>>>> ...
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Wed, Mar 10, 2021 at
3:39 PM Artem Belevich <tra at google.com
>>>>>>>> wrote:
>>>>>>>>>>>>>> On Wed, Mar 10, 2021 at
11:41 AM Johannes Doerfert <
>>>>>>>>>>>>>> johannesdoerfert at
gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Artem, Justin,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I am running into a
problem and I'm curious if I'm missing
>>>>>>>> something
>>>>>>>>>> or
>>>>>>>>>>>>>>> if the support is
simply missing.
>>>>>>>>>>>>>>> Am I correct to
assume the NVPTX backend does not deal with
>>>>>>>>>> `llvm.sin`
>>>>>>>>>>>>>>> and friends?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Correct. It can't
deal with anything that may need to lower
>> to a
>>>>>>>>>>>> standard
>>>>>>>>>>>>>> library call.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> This is what I see,
with some variations:
>>>>>>>>>> https://godbolt.org/z/PxsEWs
>>>>>>>>>>>>>>> If this is missing
in the backend, is there a plan to get
>> this
>>>>>>>>>> working,
>>>>>>>>>>>>>>> I'd really like
to have the
>>>>>>>>>>>>>>> intrinsics in the
middle end rather than __nv_cos, not to
>>>> mention
>>>>>>>>>> that
>>>>>>>>>>>>>>> -ffast-math does
emit intrinsics
>>>>>>>>>>>>>>> and crashes.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> It all boils down to
the fact that PTX does not have the
>>>> standard
>>>>>>>>>>>>>> libc/libm which LLVM
could lower the calls to, nor does it
>> have
>>>> a
>>>>>>>>>>>> 'linking'
>>>>>>>>>>>>>> phase where we could
link such a library in, if we had it.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Libdevice bitcode does
provide the implementations for some of
>>>> the
>>>>>>>>>>>>>> functions (though with
a __nv_ prefix) and clang links it in
>> in
>>>>>>>> order
>>>>>>>>>> to
>>>>>>>>>>>>>> avoid generating IR
that LLVM can't handle, but that's a
>>>>>> workaround
>>>>>>>>>> that
>>>>>>>>>>>>>> does not help LLVM
itself.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> --Artem
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> ~ Johannes
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>> ───────────────────
>>>>>>>>>>>>>>> ∽ Johannes (he/his)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> --Artem Belevich
>>>>>>>>>>>>>>
>

llvm dev - Mar 2021 - NVPTX codegen for llvm.sin (and friends)

[llvm-dev] NVPTX codegen for llvm.sin (and friends)

[llvm-dev] NVPTX codegen for llvm.sin (and friends)

[llvm-dev] NVPTX codegen for llvm.sin (and friends)