thr3ads.net - llvm dev - [llvm-dev] NVPTX codegen for llvm.sin (and friends) [Mar 2021]

If this information is useful, please help other people find it:
Share via:

Artem Belevich via llvm-dev

2021-Mar-12 18:50 UTC

[llvm-dev] NVPTX codegen for llvm.sin (and friends)

On Thu, Mar 11, 2021 at 8:26 PM Johannes Doerfert <
johannesdoerfert at gmail.com> wrote:
>
> On 3/11/21 7:59 PM, Artem Belevich wrote:
> > On Thu, Mar 11, 2021 at 4:10 PM Johannes Doerfert <
> > johannesdoerfert at gmail.com> wrote:
> >
> >> On 3/11/21 1:37 PM, Artem Belevich wrote:
> >>> On Thu, Mar 11, 2021 at 10:54 AM Johannes Doerfert <
> >>> johannesdoerfert at gmail.com> wrote:
> >>>
> >>>> I certainly agree we should try to avoid a hard-coded
mapping
> >>>> in C++.
> >>>>
> >>>> I could see something like:
> >>>>
> >>>> ```
> >>>> __attribute__((implementation("llvm.cos"))
> >>>> double __nv_cos(...) { ... }
> >>>>
> >>>> ```
> >>>>
> >>>> and a pass that transforms all calls to a function with an
> >>>> "implementation" to calls to that
implementation. Maybe
> >>>> later we attach a score/priority ;)
> >>>>
> >>>>
> >>> I'm not sure how that would work.
> >>> Where would you place that `__attribute__((implementation))` ?
We do
> not
> >>> have the definitions for `__nv_*` as they come from
NVIDIA-provided
> >>> bitcode. We could add the attribute to the declaration in
> >>> `__clang_cuda_libdevice_declares.h`.
> >>> How does LLVM handle the differences in function attributes
between
> >>> function declaration and definition? Will there be trouble
when we link
> >> in
> >>> the actual __nv_cos from the libdevice that would not have
that
> >> attribute?
> >>> Another potential gotcha is that for the functions that
can't be
> directly
> >>> mapped 1:1 to `__nv_*` counterparts, we'd still need to
provide the
> >>> implementation ourselves. We will not know whether the
implementation
> >> will
> >>> be used until after the substitution pass, so we'll need
to make sure
> >> it's
> >>> not DCE'd until then. It appears to be the same issue
(though on a
> >> smaller
> >>> scale) as with linking in libdevice directly.
> >>>
> >>> Let's take a step back and figure out what are the issues
we want to
> >> solve.
> >>> The top-level goal is to provide implementation for LLVM
intrinsics.
> For
> >>> now let's stick with libm-related ones.
> >>> What we have is the libdevice bitcode which uses different
function
> names
> >>> and provides a subset of the functionality we need.
> >>> What we miss is
> >>>     - something to connect LLVM's libcalls to the GPU-side
> implementation,
> >>>     - additional code to provide implementations for the
functions that
> >> are
> >>> missing or different in libdevice.
> >>>
> >>> Considering that we want this to work in LLVM, the additional
code
> would
> >>> have to be a bitcode and it would have to exist in addition to
> libdevice.
> >>> Our options for the mapping between LLVM intrinsics and the
> >> implementation
> >>> are
> >>> * intrinsic -> __nv_* equivalent mapping pass
> >>>      This would still need additional bitcode for the
missing/different
> >>> functions.
> >>> * lower libcalls to the standard libm APIs, implement libm
-> __nv_*
> >>> mapping in our own bitcode.
> >>>
> >>> Considering that additional bitcode is needed in both cases, I
believe
> >> that
> >>> the second approach makes more sense.
> >> I really hope to avoid any additional bitcode, there are too many
> >> drawbacks and basically no benefits, IMHO.
> >>
> > Could you elaborate on the drawbacks?
> >
> > The fact is that we already depend on the external bitcode (libdevice
in
> > this case), though right now we're trying to keep that to clang
only. The
> > current approach is not sound in principle and is rather brittle in
> > practice. Nor clang is the only source of the IR for the LLVM to
> > compile, so it leaves LLVM-only users without a good solution. There
are
> > already a handful of JIT compilers that each do their own gluing of
> > libdevice into the IR they want to compile for NVPTX. I think we do
have
> a
> > very good reason to deal with that in LLVM itself.
> >
> > While I agree that additional bitcode is a hassle, I think it would be
a
> > net positive change for LLVM usability for NVPTX users.
> > The external bitcode would not be required for those who do not need
> > libdevice now, so the change should not be disruptive.
>
>
> Bitcode comes with all the problems libdevice itself has wrt.
> compatibility.

We already have this problem, so it does not make things (much) worse than
they are.
Considering that we'll be able to keep the library in sync with LLVM, the
compatibility is less of a problem as the library that would come with LLVM
would be built with/for exactly that LLVM version.

> It is also hard to update and maintain. You basically
> maintain IR or you maintain C(++) as I suggest.

We seem to agree that the implementation of such a library would be in
C/C++.

> Also, bitcode is platform specific. I can imagine building a bitcode file
> during the
> build but shipping one means you have to know ABI and datalayout or
> hope they are the same everywhere.
>
Agreed. We will likely need multiple variants. We will compile specifically
for NVPTX or AMDGPU and we will know specific ABI and the data layout for
them regardless of the host we're building on.

It appears to me is the the difference vs what we have now is that we'll
need to have the libm sources somewhere, the process to build them for
particular GPUs (that may need to be done out of the tree as it may need
CUDA/HIP SDKs) and having to incorporate such libraries into llvm
distribution.

OK. I'll agree that that may be a bit too much for now.


> >>> LLVM does not need to know or care about what's provided
by libdevice,
> >> and
> >>> we'd have more flexibility, compared to what we could do
in the mapping
> >>> pass. It also makes it easy to substitute a different
implementation,
> if
> >> we
> >>> have or need one.
> >> I agree that LLVM (core) should not know about __nv_*, that's
why I
> >> suggested
> >> the `__attribute__((implements("...")))` approach. My
preferred solution
> >> is still to annotate our declarations of __nv_* and point to the
> >> llvm.intrinsics (name) from there. If we have a missing mapping,
we
> >> point to an
> >> intrinsic from a definition that lives in the Clang headers next
to the
> >> __nv_*  declarations.
> >>
> > We may have slightly different end goals in mind.
> > I was thinking of making the solution work for LLVM. I.e. users would
be
> > free to use llvm.sin with NVPTX back-end with a few documented steps
> needed
> > to make it work (basically "pass additional
> > -link-libm-bitcode=path/to/bitcode_libm.bc").
> >
> > Your scenario above suggests that the goal is to allow clang to
generate
> > both llvm intrinsics and the glue which would then be used by LLVM to
> make
> > it work for clang, but not in general. It's an improvement
compared to
> what
> > we have now, but I still think we should try a more general solution.
> >
> My scenario doesn't disallow a bitcode approach for non-clang
> frontends, nor does it disallow them to simply build the glue code
> with clang and package it themselves. It does however allow us to
> maintain C(++) code rather than IR, which is by itself a big win.
>
>Agreed.

>
> >> This does not yet work because -mlink-builtin-bitcode (which I
assume
> >> triggers the llvm-link logic) will drop the attributes of a
declaration
> >> if a definition is found. I think that should not be the case
anyway
> >> such that the union of attributes is set.
> >>
> >> The benefit I see for the above is that the mapping is tied to the
> >> declarations and doesn't live in a tablegen file far away. It
works well
> >> even if we can't map 1:1, and we could even restrict the
"used"
> attribute
> >> to anything that has an "implements" attribute.
> >
> > I do not think we need tablegen for anything here. I was thinking of
just
> > compiling a real math library (or a wrapper on top of libdevice) from
> C/C++
> > sources.
>
> I did not understand your suggestion before. Agreed, no tablegen.
>
>
> >
> > Our approaches are not mutually exclusive. If there's a strong
opposition
> > to providing a bitcode libm for NVPTX, implementing it somewhere
closer
> to
> > clang would still be an improvement, even if it's not as general
as I'd
> > like. It should still be possible to allow LLVM to lower libcalls in
> NVPTX
> > to standard libm API, enabled with a flag, and just let the end users
who
> > are interested (e.g. JITs) to provide their own implementation.
>
> Right. And their own implementation could be trivially created for
> them as bc file:
>
> `clang -emit-llvm-bc $clang_src/.../__clang_cuda_cmath.h -femit-all-decls`
>
> Or am I missing something here?
>
I think we're on the same page. Let's see where the
attribute(implementation) gets us.

--Artem


>
> ~ Johannes
>
>
> >
> > --Artem
> >
> >
> >
> >> So:
> >>
> >> ```
> >> __nv_A() { ... } // called, inlined and optimized as before,
DCE'ed
> after.
> >>
> >> __nv_B() { ... } // not called, DCE'ed.
> >>
> >> __attribute__((implements("llvm.C"))
> >> __nv_C() { ... } // calls are inlined and optimized as before, not
> DCE'ed
> >>                    // though because of the attribute. Replaces
llvm.C
> as
> >>                    // callee in the special pass.
> >> ```
> >>
> >> So "implements" gives you a way to statically replace a
function
> >> declaration
> >> or definition with another one. I could see it being used to
provide
> other
> >> intrinsics to platforms with backends that don't support them.
> >>
> >> Does that make some sense?
> >>
> >> ~ Johannes
> >>
> >>
> >>> WDYT?
> >>>
> >>> --Artem
> >>>
> >>>
> >>>> On 3/10/21 8:44 PM, William Moses wrote:
> >>>>> We could also consider doing something slightly
broader.
> >>>>>
> >>>>> For example we could define a special attribute on top
of the
> llvm.cos
> >>>>> call/declaration etc with metadata or an attribute
that points to the
> >>>>> actual __nv_cos function. Then in a subsequent
lowering pass the
> >>>>> corresponding intrinsic with the relevant attribute
has its uses
> >> replaced
> >>>>> by the actual function.
> >>>>>
> >>>>>
> >>>>> On Wed, Mar 10, 2021 at 7:57 PM Johannes Doerfert <
> >>>>> johannesdoerfert at gmail.com> wrote:
> >>>>>
> >>>>>> On 3/10/21 6:22 PM, Artem Belevich wrote:
> >>>>>>> On Wed, Mar 10, 2021 at 3:44 PM Johannes
Doerfert <
> >>>>>>> johannesdoerfert at gmail.com> wrote:
> >>>>>>>
> >>>>>>>> On 3/10/21 4:38 PM, Artem Belevich wrote:
> >>>>>>>>> On Wed, Mar 10, 2021 at 1:55 PM
Johannes Doerfert <
> >>>>>>>>> johannesdoerfert at gmail.com>
wrote:
> >>>>>>>>>
> >>>>>>>>>> On 3/10/21 3:25 PM, Artem Belevich
wrote:
> >>>>>>>>>>> On Wed, Mar 10, 2021 at 12:57
PM Johannes Doerfert <
> >>>>>>>>>>> johannesdoerfert at
gmail.com> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> Right. We could keep the
definition of __nv_cos and friends
> >>>>>>>>>>>> around. Right now,
-ffast-math might just crash on the user,
> >>>>>>>>>>>> which is arguably a bad
thing. I can also see us benefiting
> >>>>>>>>>>>> in various other ways from
llvm.cos uses instead of __nv_cos
> >>>>>>>>>>>> (assuming precision is
according to the user requirements but
> >>>>>>>>>>>> that is always a
condition).
> >>>>>>>>>>>>
> >>>>>>>>>>>> It could be as simple as
introducing __nv_cos into
> >>>>>>>>>>>> "llvm.used" and
a backend matching/rewrite pass.
> >>>>>>>>>>>>
> >>>>>>>>>>>> If the backend knew the
libdevice location it could even pick
> >>>>>>>>>>>> the definitions from
there. Maybe we could link libdevice late
> >>>>>>>>>>>> instead of eager?
> >>>>>>>>>>>>
> >>>>>>>>>>> It's possible, but it
would require plumbing in CUDA SDK
> >> awareness
> >>>>>> into
> >>>>>>>>>>> LLVM. While clang driver can
deal with that, LLVM currently
> >> can't.
> >>>>>> The
> >>>>>>>>>>> bitcode library path would
have to be provided by the user.
> >>>>>>>>>> The PTX backend could arguably be
CUDA SDK aware, IMHO, it would
> >>>>>>>>>> even be fine if the middle-end
does the remapping to get
> inlining
> >>>>>>>>>> and folding benefits also after
__nv_cos is used. See below.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>> The standard library as
bitcode raises some questions.
> >>>>>>>>>> Which standard library? CUDAs
libdevice is a bitcode library,
> >> right?
> >>>>>>>>> It's whatever LLVM will need to
lower libcalls to. libdevice
> >> bitcode
> >>>> is
> >>>>>>>> the
> >>>>>>>>> closest approximation of that we have
at the moment.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>>> * When do we want to do the
linking? If we do it at the
> >> beginning,
> >>>>>> then
> >>>>>>>>>> the
> >>>>>>>>>>> question is how to make sure
unused functions are not
> eliminated
> >>>>>> before
> >>>>>>>>>> we
> >>>>>>>>>>> may need them, as we don't
know apriori what's going to be
> >> needed.
> >>>> We
> >>>>>>>>>> also
> >>>>>>>>>>> do want the unused functions
to be gone after we're done.
> Linking
> >>>> it
> >>>>>> in
> >>>>>>>>>>> early would allow optimizing
the code better at the expense of
> >>>> having
> >>>>>>>> to
> >>>>>>>>>>> optimize a lot of code
we'll throw away. Linking it in late has
> >>>> less
> >>>>>>>>>>> overhead, but leaves the
linked in bitcode unoptimized, though
> >> it's
> >>>>>>>>>>> probably in the ballpark of
what would happen with a real
> library
> >>>>>> call.
> >>>>>>>>>>> I.e. no inlining, etc.
> >>>>>>>>>>>
> >>>>>>>>>>> * It incorporates linking into
LLVM, which is not LLVM's job.
> >>>>>> Arguably,
> >>>>>>>>>> the
> >>>>>>>>>>> line should be drawn at the
lowering to libcalls as it's done
> for
> >>>>>> other
> >>>>>>>>>>> back-ends. However, we're
also constrained to by the need to
> have
> >>>> the
> >>>>>>>>>>> linking done before we
generate PTX which prevents doing it
> after
> >>>>>> LLVM
> >>>>>>>> is
> >>>>>>>>>>> done generating an object
file.
> >>>>>>>>>> I'm confused. Clang links in
libdevice.bc early.
> >>>>>>>>> Yes. Because that's where it has
to happen if we want to keep
> LLVM
> >>>>>>>> unaware
> >>>>>>>>> of CUDA SDK.
> >>>>>>>>> It does not have to be the case
if/when LLVM can do the linking
> >>>> itself.
> >>>>>>>>>> If we make sure
> >>>>>>>>>> `__nv_cos` is not deleted early,
we can at any point "lower"
> >>>>>> `llvm.cos`
> >>>>>>>>>> to `__nv_cos` which is available.
After the lowering we can
> remove
> >>>>>>>>>> the artificial uses of `__nv_XXX`
functions that we used to keep
> >> the
> >>>>>>>>>> definitions around in order to
remove them from the final
> result.
> >>>>>>>>>>
> >>>>>>>>> This is the 'link early'
approach, I should've been explicit that
> >>>> it's
> >>>>>>>>> 'link early *everything*' as
opposed to linking only what's
> needed
> >> at
> >>>>>> the
> >>>>>>>>> beginning.
> >>>>>>>>> It would work at the expense of having
to process/optimize 500KB
> >>>> worth
> >>>>>> of
> >>>>>>>>> bitcode for every compilation, whether
it needs it or not.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>> We get the benefit of having
`llvm.cos` for some of the
> pipeline,
> >>>>>>>>>> we know it does not have all the
bad effects while `__nv_cos` is
> >>>>>> defined
> >>>>>>>>>> with inline assembly. We also get
the benefit of inlining
> >> `__nv_cos`
> >>>>>>>>>> and folding the implementation
based on the arguments. Finally,
> >>>>>>>>>> this should work with the existing
pipeline, the linking is the
> >> same
> >>>>>>>>>> as before, all we do is to keep
the definitions alive longer and
> >>>>>>>>>> lower `llvm.cos` to `__nv_cos` in
a middle end pass.
> >>>>>>>>>>
> >>>>>>>>> Again, I agree that it is doable.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>> This might be similar to the PTX
solution you describe below
> but I
> >>>>>> feel
> >>>>>>>>>> we get the inline benefit from
this without actually changing
> the
> >>>>>>>> pipeline
> >>>>>>>>>> at all.
> >>>>>>>>>>
> >>>>>>>>> So, to summarize:
> >>>>>>>>> * link the library as bitcode early,
add artificial placeholders
> >> for
> >>>>>>>>> everything, compile, remove
placeholders and DCE unused stuff
> away.
> >>>>>>>>>        Pros:
> >>>>>>>>>           - we're already doing
most of it before clang hands
> hands
> >> off
> >>>>>> IR to
> >>>>>>>>> LLVM, so it just pushes it a bit lower
in the compilation.
> >>>>>>>>>        Cons:
> >>>>>>>>>           - runtime cost of optimizing
libdevice bitcode,
> >>>>>>>>>           - libdevice may be required
for all NVPTX compilations?
> >>>>>>>>>
> >>>>>>>>> * link the library as bitcode late.
> >>>>>>>>>         Pros:
> >>>>>>>>>           - lower runtime cost than
link-early approach.
> >>>>>>>>>         Cons:
> >>>>>>>>>           - We'll need to make
sure that NVVMReflect pass
> processes
> >> the
> >>>>>>>> library.
> >>>>>>>>>           - less optimizations on the
library functions. Some of
> the
> >>>> code
> >>>>>>>> gets
> >>>>>>>>> DCE'ed away after NVVMReflect and
the rest could be optimized
> >> better.
> >>>>>>>>>           - libdevice may be required
for all NVPTX compilations?
> >>>>>>>>> * 'link' with the library as
PTX appended as text to LLVM's
> output
> >>>> and
> >>>>>>>> let
> >>>>>>>>> ptxas do the 'linking'
> >>>>>>>>>        Pros:  LLVM remains agnostic of
CUDA SDK installation
> >> details.
> >>>> All
> >>>>>> it
> >>>>>>>>> does is allows lowering libcalls and
leaves their resolution to
> the
> >>>>>>>>> external tools.
> >>>>>>>>>        Cons: Need to have the PTX
library somewhere and need to
> >>>> integrate
> >>>>>> the
> >>>>>>>>> 'linking' into the compilation
process somehow.
> >>>>>>>>>
> >>>>>>>>> Neither is particularly good. If the
runtime overhead of
> link-early
> >>>> is
> >>>>>>>>> acceptable, then it may be a winner
here, by a very small margin.
> >>>>>>>>> link-as-PTX may be better conceptually
as it keeps linking and
> >>>>>>>> compilation
> >>>>>>>>> separate.
> >>>>>>>>>
> >>>>>>>>> As for the practical steps, here's
what we need:
> >>>>>>>>> - allow libcall lowering in NVPTX,
possibly guarded by a flag.
> This
> >>>> is
> >>>>>>>>> needed for all of the approaches
above.
> >>>>>>>>> - teach LLVM how to link in bitcode
(and, possibly, control
> >>>> early/late
> >>>>>>>> mode)
> >>>>>>>>> - teach clang driver to delegate
libdevice linking to LLVM.
> >>>>>>>>>
> >>>>>>>>> This will allow us to experiment with
all three approaches and
> see
> >>>> what
> >>>>>>>>> works best.
> >>>>>>>> I think if we embed knowledge about the
nv_XXX functions we can
> >>>>>>>> even get away without the cons you listed
for early linking above.
> >>>>>>>>
> >>>>>>> WDYM by `embed knowledge about the nv_XXX
functions`? By linking
> >> those
> >>>>>>> functions in? Of do you mean that we should
just declare them
> >>>>>>> before/instead of linking libdevice in?
> >>>>>> I mean by providing the "libcall
lowering" pass. So the knowledge
> >>>>>> that llvm.cos maps to __nv_cos.
> >>>>>>
> >>>>>>>> For early link I'm assuming an order
similar to [0] but I also
> >> discuss
> >>>>>>>> the case where we don't link libdevice
early for a TU.
> >>>>>>>>
> >>>>>>> That link just describes the steps needed to
use libdevice. It does
> >> not
> >>>>>>> deal with how/where it fits in the LLVM
pipeline.
> >>>>>>> The gist is that NVVMreflect replaces some
conditionals with
> >> constants.
> >>>>>>> libdevice uses that as a poor man's IR
preprocessor, conditionally
> >>>>>> enabling
> >>>>>>> different implementations and relying on DCE
and constant folding
> to
> >>>>>> remove
> >>>>>>> unused parts and eliminate the now useless
branches.
> >>>>>>> While running NVVM alone will make libdevice
code valid and usable,
> >> it
> >>>>>>> would still benefit from further
optimizations. I do not know to
> what
> >>>>>>> degree, though.
> >>>>>>>
> >>>>>>>
> >>>>>>>> Link early:
> >>>>>>>> 1) clang emits module.bc and links in
libdevice.bc but with the
> >>>>>>>>         `optnone`, `noinline`, and
"used" attribute for functions
> in
> >>>>>>>>         libdevice. ("used" is
not an attribute but could as well
> be.)
> >>>>>>>>         At this stage module.bc might call
__nv_XXX or llvm.XXX
> >> freely
> >>>>>>>>         as defined by -ffast-math and
friends.
> >>>>>>>>
> >>>>>>> That could work. Just carrying extra IR around
would probably be
> OK.
> >>>>>>> We may want to do NVVMReflect as soon as we
have it linked in and,
> >>>> maybe,
> >>>>>>> allow optimizing the functions that are
explicitly used already.
> >>>>>> Right. NVVMReflect can be run twice and with
`alwaysinline`
> >>>>>> on the call sites of __nv_XXX functions we will
actually
> >>>>>> inline and optimize them while the definitions are
just "dragged
> >>>>>> along" in case we need them later.
> >>>>>>
> >>>>>>
> >>>>>>>> 2) Run some optimizations in the middle
end, maybe till the end of
> >>>>>>>>         the inliner loop, unsure.
> >>>>>>>> 3) Run a libcall lowering pass and another
NVVMReflect pass (or
> the
> >>>>>>>>         only instance thereof). We
effectively remove all llvm.XXX
> >> calls
> >>>>>>>         in favor of __nv_XXX now. Note that we
haven't spend (much)
> >> time
> >>>>>>>>         on the libdevice code as it is
optnone and most passes are
> >> good
> >>>>>>>>         at skipping those. To me, it's
unclear if the used parts
> >> should
> >>>>>>>>         not be optimized before we inline
them anyway to avoid
> >> redoing
> >>>>>>>>         the optimizations over and over
(per call site). That
> needs
> >>>>>>>>         measuring I guess. Also note that
we can still retain the
> >>>> current
> >>>>>>>>         behavior for direct calls to
__nv_XXX if we mark the call
> >> sites
> >>>>>>>>         as `alwaysinline`, or at least the
behavior is almost like
> >> the
> >>>>>>>>         current one is.
> >>>>>>>> 4) Run an always inliner pass on the
__nv_XXX calls because it is
> >>>>>>>>         something we would do right now.
Alternatively, remove
> >> `optnone`
> >>>>>>>>         and `noinline` from the __nv_XXX
calls.
> >>>>>>>> 5) Continue with the pipeline as before.
> >>>>>>>>
> >>>>>>>>
> >>>>>>> SGTM.
> >>>>>>>
> >>>>>>>
> >>>>>>>> As mentioned above, `optnone` avoids
spending time on the
> libdevice
> >>>>>>>> until we "activate" it. At that
point (globals) DCE can be
> scheduled
> >>>>>>>> to remove all unused parts right away. I
don't think this is
> (much)
> >>>>>>>> more expensive than linking libdevice
early right now.
> >>>>>>>>
> >>>>>>>> Link late, aka. translation units without
libdevice:
> >>>>>>>> 1) clang emits module.bc but does not link
in libdevice.bc, it
> will
> >> be
> >>>>>>>>         made available later. We still can
mix __nv_XXX and
> llvm.XXX
> >>>> calls
> >>>>>>>>         freely as above.
> >>>>>>>> 2) Same as above.
> >>>>>>>> 3) Same as above.
> >>>>>>>> 4) Same as above but effectively a no-op,
no __nv_XXX definitions
> >> are
> >>>>>>>>         available.
> >>>>>>>> 5) Same as above.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> I might misunderstand something about the
current pipeline but
> from
> >>>> [0]
> >>>>>>>> and the experiments I run locally it looks
like the above should
> >> cover
> >>>>>> all
> >>>>>>>> the cases. WDYT?
> >>>>>>>>
> >>>>>>>>
> >>>>>>> The `optnone` trick may indeed remove much of
the practical
> >> differences
> >>>>>>> between the early/late approaches.
> >>>>>>> In principle it should work.
> >>>>>>>
> >>>>>>> Next question is -- is libdevice sufficient to
satisfy LLVM's
> >>>> assumptions
> >>>>>>> about the standard library.
> >>>>>>> While it does provide most of the equivalents
of libm functions,
> the
> >>>> set
> >>>>>> is
> >>>>>>> not complete and some of the functions differ
from their libm
> >>>>>> counterparts.
> >>>>>>> The differences are minor, so we should be
able to deal with it by
> >>>>>>> generating few wrapper functions for the odd
cases.
> >>>>>>> Here's what clang does to provide math
functions using libdevice:
> >>>>>>>
> >>
>
https://github.com/llvm/llvm-project/blob/main/clang/lib/Headers/__clang_cuda_math.h
> >>>>>> Right now, clang will generate any llvm intrinsic
and we crash, so
> >>>> anything
> >>>>>> else is probably a step in the right direction.
Eventually, we
> should
> >>>>>> "lower"
> >>>>>> all intrinsics that the NVPTX backend can't
handle or at least emit
> a
> >>>> nice
> >>>>>> error message. Preferably, clang would know what
we can't deal with
> >> and
> >>>> not
> >>>>>> generate intinsic calls for those in the first
place.
> >>>>>>
> >>>>>>
> >>>>>>> The most concerning aspect of libdevice is
that we don't know when
> >>>> we'll
> >>>>>> no
> >>>>>>> longer be able to use the libdevice bitcode?
My understanding is
> that
> >>>> IR
> >>>>>>> does not guarantee binary stability and at
some point we may just
> be
> >>>>>> unable
> >>>>>>> to use it. Ideally we need our own libm for
GPUs.
> >>>>>> For OpenMP I did my best to avoid writing libm
(code) for GPUs by
> >> piggy
> >>>>>> backing on CUDA and libc++ implementations, I hope
it will stay that
> >>>> way.
> >>>>>> That said, if the need arises we might really have
to port libc++ to
> >> the
> >>>>>> GPUs.
> >>>>>>
> >>>>>> Back to the problem with libdevice. I agree that
the solution of
> >> NVIDIA
> >>>>>> to ship a .bc library is suboptimal but with the
existing, or an
> >>>> extended,
> >>>>>> auto-upgrader we might be able to make that work
reasonably well for
> >> the
> >>>>>> foreseeable future. That problem is orthogonal to
what we are
> >> discussing
> >>>>>> above, I think.
> >>>>>>
> >>>>>> ~ Johannes
> >>>>>>
> >>>>>>
> >>>>>>> --Artem
> >>>>>>>
> >>>>>>>
> >>>>>>>> ~ Johannes
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> P.S. If the rewrite capability (aka
libcall lowering) is generic
> we
> >>>>>> could
> >>>>>>>>           use the scheme for many other
things as well.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> [0]
https://llvm.org/docs/NVPTXUsage.html#linking-with-libdevice
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>> --Artem
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>> ~ Johannes
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>> One thing that may work within
the existing compilation model
> is
> >> to
> >>>>>>>>>>> pre-compile the standard
library into PTX and then textually
> >> embed
> >>>>>>>>>> relevant
> >>>>>>>>>>> functions into the generated
PTX, thus pushing the 'linking'
> >> phase
> >>>>>> past
> >>>>>>>>>> the
> >>>>>>>>>>> end of LLVM's compilation
and make it look closer to the
> standard
> >>>>>>>>>>> compile/link process. This way
we'd only enable libcall
> lowering
> >> in
> >>>>>>>>>> NVPTX,
> >>>>>>>>>>> assuming that the library
functions will be magically available
> >> out
> >>>>>>>>>> there.
> >>>>>>>>>>> Injection of PTX could be done
with an external script outside
> of
> >>>>>> LLVM
> >>>>>>>>>> and
> >>>>>>>>>>> it could be incorporated into
clang driver. Bonus points for
> the
> >>>> fact
> >>>>>>>>>> that
> >>>>>>>>>>> this scheme is compatible with
-fgpu-rdc out of the box --
> >> assemble
> >>>>>> the
> >>>>>>>>>> PTX
> >>>>>>>>>>> with `ptxas -rdc` and then
actually link with the library,
> >> instead
> >>>> of
> >>>>>>>>>>> injecting its PTX before
invoking ptxas.
> >>>>>>>>>>>
> >>>>>>>>>>> --Artem
> >>>>>>>>>>>
> >>>>>>>>>>> Trying to figure out a good
way to have the cake and eat it
> too.
> >>>>>>>>>>>> ~ Johannes
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> On 3/10/21 2:49 PM,
William Moses wrote:
> >>>>>>>>>>>>> Since clang (and
arguably any other frontend that uses)
> should
> >>>> link
> >>>>>>>> in
> >>>>>>>>>>>>> libdevice, could we
lower these intrinsics to the libdevice
> >> code?
> >>>>>>>>>>> The linking happens *before*
LLVM gets to work on IR.
> >>>>>>>>>>> As I said, it's a
workaround, not the solution. It's possible
> for
> >>>>>> LLVM
> >>>>>>>> to
> >>>>>>>>>>> still attempt lowering
something in the IR into a libcall and
> we
> >>>>>> would
> >>>>>>>>>> not
> >>>>>>>>>>> be able to deal with that. It
happens to work well enough in
> >>>>>> practice.
> >>>>>>>>>>> Do you have an example where
you see the problem with
> >> -ffast-math?
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>>> For example, consider
compiling the simple device function
> >> below:
> >>>>>>>>>>>>> ```
> >>>>>>>>>>>>> //
/mnt/sabrent/wmoses/llvm13/build/bin/clang tmp.cu -S
> >>>> -emit-llvm
> >>>>>>>>>>>>>        
--cuda-path=/usr/local/cuda-11.0
> >>>> -L/usr/local/cuda-11.0/lib64
> >>>>>>>>>>>>> --cuda-gpu-arch=sm_37
> >>>>>>>>>>>>> __device__ double
f(double x) {
> >>>>>>>>>>>>>            return
cos(x);
> >>>>>>>>>>>>> }
> >>>>>>>>>>>>> ```
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> The LLVM module for it
is as follows:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> ```
> >>>>>>>>>>>>> ...
> >>>>>>>>>>>>> define dso_local
double @_Z1fd(double %x) #0 {
> >>>>>>>>>>>>> entry:
> >>>>>>>>>>>>>          %__a.addr.i =
alloca double, align 8
> >>>>>>>>>>>>>          %x.addr =
alloca double, align 8
> >>>>>>>>>>>>>          store double
%x, double* %x.addr, align 8
> >>>>>>>>>>>>>          %0 = load
double, double* %x.addr, align 8
> >>>>>>>>>>>>>          store double
%0, double* %__a.addr.i, align 8
> >>>>>>>>>>>>>          %1 = load
double, double* %__a.addr.i, align 8
> >>>>>>>>>>>>>          %call.i =
call contract double @__nv_cos(double %1)
> #7
> >>>>>>>>>>>>>          ret double
%call.i
> >>>>>>>>>>>>> }
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> define internal double
@__nv_cos(double %a) #1 {
> >>>>>>>>>>>>>          %q.i = alloca
i32, align 4
> >>>>>>>>>>>>> ```
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Obviously we would
need to do something to ensure these
> >> functions
> >>>>>>>> don't
> >>>>>>>>>>>> get
> >>>>>>>>>>>>> deleted prior to their
use in lowering from intrinsic to
> >>>> libdevice.
> >>>>>>>>>>>>> ...
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On Wed, Mar 10, 2021
at 3:39 PM Artem Belevich <
> tra at google.com
> >>>>>>>> wrote:
> >>>>>>>>>>>>>> On Wed, Mar 10,
2021 at 11:41 AM Johannes Doerfert <
> >>>>>>>>>>>>>> johannesdoerfert
at gmail.com> wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Artem, Justin,
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> I am running
into a problem and I'm curious if I'm missing
> >>>>>>>> something
> >>>>>>>>>> or
> >>>>>>>>>>>>>>> if the support
is simply missing.
> >>>>>>>>>>>>>>> Am I correct
to assume the NVPTX backend does not deal with
> >>>>>>>>>> `llvm.sin`
> >>>>>>>>>>>>>>> and friends?
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Correct. It
can't deal with anything that may need to lower
> >> to a
> >>>>>>>>>>>> standard
> >>>>>>>>>>>>>> library call.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> This is what I
see, with some variations:
> >>>>>>>>>> https://godbolt.org/z/PxsEWs
> >>>>>>>>>>>>>>> If this is
missing in the backend, is there a plan to get
> >> this
> >>>>>>>>>> working,
> >>>>>>>>>>>>>>> I'd really
like to have the
> >>>>>>>>>>>>>>> intrinsics in
the middle end rather than __nv_cos, not to
> >>>> mention
> >>>>>>>>>> that
> >>>>>>>>>>>>>>> -ffast-math
does emit intrinsics
> >>>>>>>>>>>>>>> and crashes.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>> It all boils down
to the fact that PTX does not have the
> >>>> standard
> >>>>>>>>>>>>>> libc/libm which
LLVM could lower the calls to, nor does it
> >> have
> >>>> a
> >>>>>>>>>>>> 'linking'
> >>>>>>>>>>>>>> phase where we
could link such a library in, if we had it.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Libdevice bitcode
does provide the implementations for some
> of
> >>>> the
> >>>>>>>>>>>>>> functions (though
with a __nv_ prefix) and clang links it in
> >> in
> >>>>>>>> order
> >>>>>>>>>> to
> >>>>>>>>>>>>>> avoid generating
IR that LLVM can't handle, but that's a
> >>>>>> workaround
> >>>>>>>>>> that
> >>>>>>>>>>>>>> does not help LLVM
itself.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> --Artem
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> ~ Johannes
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> --
> >>>>>>>>>>>>>>>
───────────────────
> >>>>>>>>>>>>>>> ∽ Johannes
(he/his)
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>> --
> >>>>>>>>>>>>>> --Artem Belevich
> >>>>>>>>>>>>>>
> >
>

-- 
--Artem Belevich
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20210312/74a39897/attachment.html>

James Y Knight via llvm-dev

2021-Mar-12 22:38 UTC

head link

[llvm-dev] NVPTX codegen for llvm.sin (and friends)

On Fri, Mar 12, 2021 at 1:51 PM Artem Belevich via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
> Also, bitcode is platform specific. I can imagine building a bitcode file
>> during the
>> build but shipping one means you have to know ABI and datalayout or
>> hope they are the same everywhere.
>>
>
> Agreed. We will likely need multiple variants. We will compile
> specifically for NVPTX or AMDGPU and we will know specific ABI and the data
> layout for them regardless of the host we're building on.
>
> It appears to me is the the difference vs what we have now is that
we'll
> need to have the libm sources somewhere, the process to build them for
> particular GPUs (that may need to be done out of the tree as it may need
> CUDA/HIP SDKs) and having to incorporate such libraries into llvm
> distribution.
>
> OK. I'll agree that that may be a bit too much for now.
>
It sounded before like you were saying the library should effectively
be function aliases for standard libm names, to call __nv_ names. Isn't it
utterly trivial to generate such a bitcode file as part of the toolchain
build, without requiring any external SDKs?
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20210312/e01d8a1a/attachment.html>

llvm dev - Mar 2021 - NVPTX codegen for llvm.sin (and friends)

[llvm-dev] NVPTX codegen for llvm.sin (and friends)

[llvm-dev] NVPTX codegen for llvm.sin (and friends)