thr3ads.net - llvm dev - [llvm-dev] NVPTX codegen for llvm.sin (and friends) [Mar 2021]

If this information is useful, please help other people find it:
Share via:

Johannes Doerfert via llvm-dev

2021-Mar-10 20:57 UTC

[llvm-dev] NVPTX codegen for llvm.sin (and friends)

Right. We could keep the definition of __nv_cos and friends
around. Right now, -ffast-math might just crash on the user,
which is arguably a bad thing. I can also see us benefiting
in various other ways from llvm.cos uses instead of __nv_cos
(assuming precision is according to the user requirements but
that is always a condition).

It could be as simple as introducing __nv_cos into
"llvm.used" and a backend matching/rewrite pass.

If the backend knew the libdevice location it could even pick
the definitions from there. Maybe we could link libdevice late
instead of eager?

Trying to figure out a good way to have the cake and eat it too.

~ Johannes


On 3/10/21 2:49 PM, William Moses wrote:> Since clang (and arguably any other frontend that uses) should link in
> libdevice, could we lower these intrinsics to the libdevice code?
>
> For example, consider compiling the simple device function below:
>
> ```
> // /mnt/sabrent/wmoses/llvm13/build/bin/clang tmp.cu -S -emit-llvm
>   --cuda-path=/usr/local/cuda-11.0 -L/usr/local/cuda-11.0/lib64
> --cuda-gpu-arch=sm_37
> __device__ double f(double x) {
>      return cos(x);
> }
> ```
>
> The LLVM module for it is as follows:
>
> ```
> ...
> define dso_local double @_Z1fd(double %x) #0 {
> entry:
>    %__a.addr.i = alloca double, align 8
>    %x.addr = alloca double, align 8
>    store double %x, double* %x.addr, align 8
>    %0 = load double, double* %x.addr, align 8
>    store double %0, double* %__a.addr.i, align 8
>    %1 = load double, double* %__a.addr.i, align 8
>    %call.i = call contract double @__nv_cos(double %1) #7
>    ret double %call.i
> }
>
> define internal double @__nv_cos(double %a) #1 {
>    %q.i = alloca i32, align 4
> ```
>
> Obviously we would need to do something to ensure these functions don't
get
> deleted prior to their use in lowering from intrinsic to libdevice.
> ...
>
>
> On Wed, Mar 10, 2021 at 3:39 PM Artem Belevich <tra at google.com>
wrote:
>
>> On Wed, Mar 10, 2021 at 11:41 AM Johannes Doerfert <
>> johannesdoerfert at gmail.com> wrote:
>>
>>> Artem, Justin,
>>>
>>> I am running into a problem and I'm curious if I'm missing
something or
>>> if the support is simply missing.
>>> Am I correct to assume the NVPTX backend does not deal with
`llvm.sin`
>>> and friends?
>>>
>> Correct. It can't deal with anything that may need to lower to a
standard
>> library call.
>>
>>> This is what I see, with some variations:
https://godbolt.org/z/PxsEWs
>>>
>>> If this is missing in the backend, is there a plan to get this
working,
>>> I'd really like to have the
>>> intrinsics in the middle end rather than __nv_cos, not to mention
that
>>> -ffast-math does emit intrinsics
>>> and crashes.
>>>
>> It all boils down to the fact that PTX does not have the standard
>> libc/libm which LLVM could lower the calls to, nor does it have a
'linking'
>> phase where we could link such a library in, if we had it.
>>
>> Libdevice bitcode does provide the implementations for some of the
>> functions (though with a __nv_ prefix) and clang links it in in order
to
>> avoid generating IR that LLVM can't handle, but that's a
workaround that
>> does not help LLVM itself.
>>
>> --Artem
>>
>>
>>
>>> ~ Johannes
>>>
>>>
>>> --
>>> ───────────────────
>>> ∽ Johannes (he/his)
>>>
>>>
>> --
>> --Artem Belevich
>>

Artem Belevich via llvm-dev

2021-Mar-10 21:25 UTC

head link

[llvm-dev] NVPTX codegen for llvm.sin (and friends)

On Wed, Mar 10, 2021 at 12:57 PM Johannes Doerfert <
johannesdoerfert at gmail.com> wrote:
> Right. We could keep the definition of __nv_cos and friends
> around. Right now, -ffast-math might just crash on the user,
> which is arguably a bad thing. I can also see us benefiting
> in various other ways from llvm.cos uses instead of __nv_cos
> (assuming precision is according to the user requirements but
> that is always a condition).
>
> It could be as simple as introducing __nv_cos into
> "llvm.used" and a backend matching/rewrite pass.
>
> If the backend knew the libdevice location it could even pick
> the definitions from there. Maybe we could link libdevice late
> instead of eager?
>
It's possible, but it would require plumbing in CUDA SDK awareness into
LLVM. While clang driver can deal with that, LLVM currently can't. The
bitcode library path would have to be provided by the user.

The standard library as bitcode raises some questions.
* When do we want to do the linking? If we do it at the beginning, then the
question is how to make sure unused functions are not eliminated before we
may need them, as we don't know apriori what's going to be needed. We
also
do want the unused functions to be gone after we're done. Linking it in
early would allow optimizing the code better at the expense of having to
optimize a lot of code we'll throw away. Linking it in late has less
overhead, but leaves the linked in bitcode unoptimized, though it's
probably in the ballpark of what would happen with a real library call.
I.e. no inlining, etc.

* It incorporates linking into LLVM, which is not LLVM's job. Arguably, the
line should be drawn at the lowering to libcalls as it's done for other
back-ends. However, we're also constrained to by the need to have the
linking done before we generate PTX which prevents doing it after LLVM is
done generating an object file.

One thing that may work within the existing compilation model is to
pre-compile the standard library into PTX and then textually embed relevant
functions into the generated PTX, thus pushing the 'linking' phase past
the
end of LLVM's compilation and make it look closer to the standard
compile/link process. This way we'd only enable libcall lowering in NVPTX,
assuming that the library functions will be magically available out there.
Injection of PTX could be done with an external script outside of LLVM and
it could be incorporated into clang driver. Bonus points for the fact that
this scheme is compatible with -fgpu-rdc out of the box -- assemble the PTX
with `ptxas -rdc` and then actually link with the library, instead of
injecting its PTX before invoking ptxas.

--Artem

Trying to figure out a good way to have the cake and eat it
too.>
> ~ Johannes
>
>
> On 3/10/21 2:49 PM, William Moses wrote:
> > Since clang (and arguably any other frontend that uses) should link in
> > libdevice, could we lower these intrinsics to the libdevice code?
>
The linking happens *before* LLVM gets to work on IR.
As I said, it's a workaround, not the solution. It's possible for LLVM
to
still attempt lowering something in the IR into a libcall and we would not
be able to deal with that. It happens to work well enough in practice.

Do you have an example where you see the problem with -ffast-math?


> >
> > For example, consider compiling the simple device function below:
> >
> > ```
> > // /mnt/sabrent/wmoses/llvm13/build/bin/clang tmp.cu -S -emit-llvm
> >   --cuda-path=/usr/local/cuda-11.0 -L/usr/local/cuda-11.0/lib64
> > --cuda-gpu-arch=sm_37
> > __device__ double f(double x) {
> >      return cos(x);
> > }
> > ```
> >
> > The LLVM module for it is as follows:
> >
> > ```
> > ...
> > define dso_local double @_Z1fd(double %x) #0 {
> > entry:
> >    %__a.addr.i = alloca double, align 8
> >    %x.addr = alloca double, align 8
> >    store double %x, double* %x.addr, align 8
> >    %0 = load double, double* %x.addr, align 8
> >    store double %0, double* %__a.addr.i, align 8
> >    %1 = load double, double* %__a.addr.i, align 8
> >    %call.i = call contract double @__nv_cos(double %1) #7
> >    ret double %call.i
> > }
> >
> > define internal double @__nv_cos(double %a) #1 {
> >    %q.i = alloca i32, align 4
> > ```
> >
> > Obviously we would need to do something to ensure these functions
don't
> get
> > deleted prior to their use in lowering from intrinsic to libdevice.
> > ...
> >
> >
> > On Wed, Mar 10, 2021 at 3:39 PM Artem Belevich <tra at
google.com> wrote:
> >
> >> On Wed, Mar 10, 2021 at 11:41 AM Johannes Doerfert <
> >> johannesdoerfert at gmail.com> wrote:
> >>
> >>> Artem, Justin,
> >>>
> >>> I am running into a problem and I'm curious if I'm
missing something or
> >>> if the support is simply missing.
> >>> Am I correct to assume the NVPTX backend does not deal with
`llvm.sin`
> >>> and friends?
> >>>
> >> Correct. It can't deal with anything that may need to lower to
a
> standard
> >> library call.
> >>
> >>> This is what I see, with some variations:
https://godbolt.org/z/PxsEWs
> >>>
> >>> If this is missing in the backend, is there a plan to get this
working,
> >>> I'd really like to have the
> >>> intrinsics in the middle end rather than __nv_cos, not to
mention that
> >>> -ffast-math does emit intrinsics
> >>> and crashes.
> >>>
> >> It all boils down to the fact that PTX does not have the standard
> >> libc/libm which LLVM could lower the calls to, nor does it have a
> 'linking'
> >> phase where we could link such a library in, if we had it.
> >>
> >> Libdevice bitcode does provide the implementations for some of the
> >> functions (though with a __nv_ prefix) and clang links it in in
order to
> >> avoid generating IR that LLVM can't handle, but that's a
workaround that
> >> does not help LLVM itself.
> >>
> >> --Artem
> >>
> >>
> >>
> >>> ~ Johannes
> >>>
> >>>
> >>> --
> >>> ───────────────────
> >>> ∽ Johannes (he/his)
> >>>
> >>>
> >> --
> >> --Artem Belevich
> >>
>

-- 
--Artem Belevich
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20210310/68a0c9e7/attachment-0001.html>

llvm dev - Mar 2021 - NVPTX codegen for llvm.sin (and friends)

[llvm-dev] NVPTX codegen for llvm.sin (and friends)

[llvm-dev] NVPTX codegen for llvm.sin (and friends)