William Moses via llvm-dev
2021-Mar-11 02:44 UTC
[llvm-dev] NVPTX codegen for llvm.sin (and friends)
We could also consider doing something slightly broader. For example we could define a special attribute on top of the llvm.cos call/declaration etc with metadata or an attribute that points to the actual __nv_cos function. Then in a subsequent lowering pass the corresponding intrinsic with the relevant attribute has its uses replaced by the actual function. On Wed, Mar 10, 2021 at 7:57 PM Johannes Doerfert < johannesdoerfert at gmail.com> wrote:> > On 3/10/21 6:22 PM, Artem Belevich wrote: > > On Wed, Mar 10, 2021 at 3:44 PM Johannes Doerfert < > > johannesdoerfert at gmail.com> wrote: > > > >> On 3/10/21 4:38 PM, Artem Belevich wrote: > >>> On Wed, Mar 10, 2021 at 1:55 PM Johannes Doerfert < > >>> johannesdoerfert at gmail.com> wrote: > >>> > >>>> On 3/10/21 3:25 PM, Artem Belevich wrote: > >>>>> On Wed, Mar 10, 2021 at 12:57 PM Johannes Doerfert < > >>>>> johannesdoerfert at gmail.com> wrote: > >>>>> > >>>>>> Right. We could keep the definition of __nv_cos and friends > >>>>>> around. Right now, -ffast-math might just crash on the user, > >>>>>> which is arguably a bad thing. I can also see us benefiting > >>>>>> in various other ways from llvm.cos uses instead of __nv_cos > >>>>>> (assuming precision is according to the user requirements but > >>>>>> that is always a condition). > >>>>>> > >>>>>> It could be as simple as introducing __nv_cos into > >>>>>> "llvm.used" and a backend matching/rewrite pass. > >>>>>> > >>>>>> If the backend knew the libdevice location it could even pick > >>>>>> the definitions from there. Maybe we could link libdevice late > >>>>>> instead of eager? > >>>>>> > >>>>> It's possible, but it would require plumbing in CUDA SDK awareness > into > >>>>> LLVM. While clang driver can deal with that, LLVM currently can't. > The > >>>>> bitcode library path would have to be provided by the user. > >>>> The PTX backend could arguably be CUDA SDK aware, IMHO, it would > >>>> even be fine if the middle-end does the remapping to get inlining > >>>> and folding benefits also after __nv_cos is used. See below. > >>>> > >>>> > >>>>> The standard library as bitcode raises some questions. > >>>> Which standard library? CUDAs libdevice is a bitcode library, right? > >>>> > >>> It's whatever LLVM will need to lower libcalls to. libdevice bitcode is > >> the > >>> closest approximation of that we have at the moment. > >>> > >>> > >>>>> * When do we want to do the linking? If we do it at the beginning, > then > >>>> the > >>>>> question is how to make sure unused functions are not eliminated > before > >>>> we > >>>>> may need them, as we don't know apriori what's going to be needed. We > >>>> also > >>>>> do want the unused functions to be gone after we're done. Linking it > in > >>>>> early would allow optimizing the code better at the expense of having > >> to > >>>>> optimize a lot of code we'll throw away. Linking it in late has less > >>>>> overhead, but leaves the linked in bitcode unoptimized, though it's > >>>>> probably in the ballpark of what would happen with a real library > call. > >>>>> I.e. no inlining, etc. > >>>>> > >>>>> * It incorporates linking into LLVM, which is not LLVM's job. > Arguably, > >>>> the > >>>>> line should be drawn at the lowering to libcalls as it's done for > other > >>>>> back-ends. However, we're also constrained to by the need to have the > >>>>> linking done before we generate PTX which prevents doing it after > LLVM > >> is > >>>>> done generating an object file. > >>>> I'm confused. Clang links in libdevice.bc early. > >>> Yes. Because that's where it has to happen if we want to keep LLVM > >> unaware > >>> of CUDA SDK. > >>> It does not have to be the case if/when LLVM can do the linking itself. > >>> > >>> > >>>> If we make sure > >>>> `__nv_cos` is not deleted early, we can at any point "lower" > `llvm.cos` > >>>> to `__nv_cos` which is available. After the lowering we can remove > >>>> the artificial uses of `__nv_XXX` functions that we used to keep the > >>>> definitions around in order to remove them from the final result. > >>>> > >>> This is the 'link early' approach, I should've been explicit that it's > >>> 'link early *everything*' as opposed to linking only what's needed at > the > >>> beginning. > >>> It would work at the expense of having to process/optimize 500KB worth > of > >>> bitcode for every compilation, whether it needs it or not. > >>> > >>> > >>>> We get the benefit of having `llvm.cos` for some of the pipeline, > >>>> we know it does not have all the bad effects while `__nv_cos` is > defined > >>>> with inline assembly. We also get the benefit of inlining `__nv_cos` > >>>> and folding the implementation based on the arguments. Finally, > >>>> this should work with the existing pipeline, the linking is the same > >>>> as before, all we do is to keep the definitions alive longer and > >>>> lower `llvm.cos` to `__nv_cos` in a middle end pass. > >>>> > >>> Again, I agree that it is doable. > >>> > >>> > >>> > >>>> This might be similar to the PTX solution you describe below but I > feel > >>>> we get the inline benefit from this without actually changing the > >> pipeline > >>>> at all. > >>>> > >>> So, to summarize: > >>> * link the library as bitcode early, add artificial placeholders for > >>> everything, compile, remove placeholders and DCE unused stuff away. > >>> Pros: > >>> - we're already doing most of it before clang hands hands off > IR to > >>> LLVM, so it just pushes it a bit lower in the compilation. > >>> Cons: > >>> - runtime cost of optimizing libdevice bitcode, > >>> - libdevice may be required for all NVPTX compilations? > >>> > >>> * link the library as bitcode late. > >>> Pros: > >>> - lower runtime cost than link-early approach. > >>> Cons: > >>> - We'll need to make sure that NVVMReflect pass processes the > >> library. > >>> - less optimizations on the library functions. Some of the code > >> gets > >>> DCE'ed away after NVVMReflect and the rest could be optimized better. > >>> - libdevice may be required for all NVPTX compilations? > >>> * 'link' with the library as PTX appended as text to LLVM's output and > >> let > >>> ptxas do the 'linking' > >>> Pros: LLVM remains agnostic of CUDA SDK installation details. All > it > >>> does is allows lowering libcalls and leaves their resolution to the > >>> external tools. > >>> Cons: Need to have the PTX library somewhere and need to integrate > the > >>> 'linking' into the compilation process somehow. > >>> > >>> Neither is particularly good. If the runtime overhead of link-early is > >>> acceptable, then it may be a winner here, by a very small margin. > >>> link-as-PTX may be better conceptually as it keeps linking and > >> compilation > >>> separate. > >>> > >>> As for the practical steps, here's what we need: > >>> - allow libcall lowering in NVPTX, possibly guarded by a flag. This is > >>> needed for all of the approaches above. > >>> - teach LLVM how to link in bitcode (and, possibly, control early/late > >> mode) > >>> - teach clang driver to delegate libdevice linking to LLVM. > >>> > >>> This will allow us to experiment with all three approaches and see what > >>> works best. > >> I think if we embed knowledge about the nv_XXX functions we can > >> even get away without the cons you listed for early linking above. > >> > > WDYM by `embed knowledge about the nv_XXX functions`? By linking those > > functions in? Of do you mean that we should just declare them > > before/instead of linking libdevice in? > I mean by providing the "libcall lowering" pass. So the knowledge > that llvm.cos maps to __nv_cos. > > > > > > >> For early link I'm assuming an order similar to [0] but I also discuss > >> the case where we don't link libdevice early for a TU. > >> > > That link just describes the steps needed to use libdevice. It does not > > deal with how/where it fits in the LLVM pipeline. > > The gist is that NVVMreflect replaces some conditionals with constants. > > libdevice uses that as a poor man's IR preprocessor, conditionally > enabling > > different implementations and relying on DCE and constant folding to > remove > > unused parts and eliminate the now useless branches. > > While running NVVM alone will make libdevice code valid and usable, it > > would still benefit from further optimizations. I do not know to what > > degree, though. > > > > > >> Link early: > >> 1) clang emits module.bc and links in libdevice.bc but with the > >> `optnone`, `noinline`, and "used" attribute for functions in > >> libdevice. ("used" is not an attribute but could as well be.) > >> At this stage module.bc might call __nv_XXX or llvm.XXX freely > >> as defined by -ffast-math and friends. > >> > > That could work. Just carrying extra IR around would probably be OK. > > We may want to do NVVMReflect as soon as we have it linked in and, maybe, > > allow optimizing the functions that are explicitly used already. > > Right. NVVMReflect can be run twice and with `alwaysinline` > on the call sites of __nv_XXX functions we will actually > inline and optimize them while the definitions are just "dragged > along" in case we need them later. > > > >> 2) Run some optimizations in the middle end, maybe till the end of > >> the inliner loop, unsure. > >> 3) Run a libcall lowering pass and another NVVMReflect pass (or the > >> only instance thereof). We effectively remove all llvm.XXX calls > > in favor of __nv_XXX now. Note that we haven't spend (much) time > >> on the libdevice code as it is optnone and most passes are good > >> at skipping those. To me, it's unclear if the used parts should > >> not be optimized before we inline them anyway to avoid redoing > >> the optimizations over and over (per call site). That needs > >> measuring I guess. Also note that we can still retain the current > >> behavior for direct calls to __nv_XXX if we mark the call sites > >> as `alwaysinline`, or at least the behavior is almost like the > >> current one is. > >> 4) Run an always inliner pass on the __nv_XXX calls because it is > >> something we would do right now. Alternatively, remove `optnone` > >> and `noinline` from the __nv_XXX calls. > >> 5) Continue with the pipeline as before. > >> > >> > > SGTM. > > > > > >> As mentioned above, `optnone` avoids spending time on the libdevice > >> until we "activate" it. At that point (globals) DCE can be scheduled > >> to remove all unused parts right away. I don't think this is (much) > >> more expensive than linking libdevice early right now. > >> > >> Link late, aka. translation units without libdevice: > >> 1) clang emits module.bc but does not link in libdevice.bc, it will be > >> made available later. We still can mix __nv_XXX and llvm.XXX calls > >> freely as above. > >> 2) Same as above. > >> 3) Same as above. > >> 4) Same as above but effectively a no-op, no __nv_XXX definitions are > >> available. > >> 5) Same as above. > >> > >> > >> I might misunderstand something about the current pipeline but from [0] > >> and the experiments I run locally it looks like the above should cover > all > >> the cases. WDYT? > >> > >> > > The `optnone` trick may indeed remove much of the practical differences > > between the early/late approaches. > > In principle it should work. > > > > Next question is -- is libdevice sufficient to satisfy LLVM's assumptions > > about the standard library. > > While it does provide most of the equivalents of libm functions, the set > is > > not complete and some of the functions differ from their libm > counterparts. > > The differences are minor, so we should be able to deal with it by > > generating few wrapper functions for the odd cases. > > Here's what clang does to provide math functions using libdevice: > > > https://github.com/llvm/llvm-project/blob/main/clang/lib/Headers/__clang_cuda_math.h > > Right now, clang will generate any llvm intrinsic and we crash, so anything > else is probably a step in the right direction. Eventually, we should > "lower" > all intrinsics that the NVPTX backend can't handle or at least emit a nice > error message. Preferably, clang would know what we can't deal with and not > generate intinsic calls for those in the first place. > > > > > > The most concerning aspect of libdevice is that we don't know when we'll > no > > longer be able to use the libdevice bitcode? My understanding is that IR > > does not guarantee binary stability and at some point we may just be > unable > > to use it. Ideally we need our own libm for GPUs. > > For OpenMP I did my best to avoid writing libm (code) for GPUs by piggy > backing on CUDA and libc++ implementations, I hope it will stay that way. > That said, if the need arises we might really have to port libc++ to the > GPUs. > > Back to the problem with libdevice. I agree that the solution of NVIDIA > to ship a .bc library is suboptimal but with the existing, or an extended, > auto-upgrader we might be able to make that work reasonably well for the > foreseeable future. That problem is orthogonal to what we are discussing > above, I think. > > ~ Johannes > > > > > > --Artem > > > > > >> ~ Johannes > >> > >> > >> P.S. If the rewrite capability (aka libcall lowering) is generic we > could > >> use the scheme for many other things as well. > >> > >> > >> [0] https://llvm.org/docs/NVPTXUsage.html#linking-with-libdevice > >> > >> > >>> --Artem > >>> > >>> > >>>> ~ Johannes > >>>> > >>>> > >>>>> One thing that may work within the existing compilation model is to > >>>>> pre-compile the standard library into PTX and then textually embed > >>>> relevant > >>>>> functions into the generated PTX, thus pushing the 'linking' phase > past > >>>> the > >>>>> end of LLVM's compilation and make it look closer to the standard > >>>>> compile/link process. This way we'd only enable libcall lowering in > >>>> NVPTX, > >>>>> assuming that the library functions will be magically available out > >>>> there. > >>>>> Injection of PTX could be done with an external script outside of > LLVM > >>>> and > >>>>> it could be incorporated into clang driver. Bonus points for the fact > >>>> that > >>>>> this scheme is compatible with -fgpu-rdc out of the box -- assemble > the > >>>> PTX > >>>>> with `ptxas -rdc` and then actually link with the library, instead of > >>>>> injecting its PTX before invoking ptxas. > >>>>> > >>>>> --Artem > >>>>> > >>>>> Trying to figure out a good way to have the cake and eat it too. > >>>>>> ~ Johannes > >>>>>> > >>>>>> > >>>>>> On 3/10/21 2:49 PM, William Moses wrote: > >>>>>>> Since clang (and arguably any other frontend that uses) should link > >> in > >>>>>>> libdevice, could we lower these intrinsics to the libdevice code? > >>>>> The linking happens *before* LLVM gets to work on IR. > >>>>> As I said, it's a workaround, not the solution. It's possible for > LLVM > >> to > >>>>> still attempt lowering something in the IR into a libcall and we > would > >>>> not > >>>>> be able to deal with that. It happens to work well enough in > practice. > >>>>> > >>>>> Do you have an example where you see the problem with -ffast-math? > >>>>> > >>>>> > >>>>> > >>>>>>> For example, consider compiling the simple device function below: > >>>>>>> > >>>>>>> ``` > >>>>>>> // /mnt/sabrent/wmoses/llvm13/build/bin/clang tmp.cu -S -emit-llvm > >>>>>>> --cuda-path=/usr/local/cuda-11.0 -L/usr/local/cuda-11.0/lib64 > >>>>>>> --cuda-gpu-arch=sm_37 > >>>>>>> __device__ double f(double x) { > >>>>>>> return cos(x); > >>>>>>> } > >>>>>>> ``` > >>>>>>> > >>>>>>> The LLVM module for it is as follows: > >>>>>>> > >>>>>>> ``` > >>>>>>> ... > >>>>>>> define dso_local double @_Z1fd(double %x) #0 { > >>>>>>> entry: > >>>>>>> %__a.addr.i = alloca double, align 8 > >>>>>>> %x.addr = alloca double, align 8 > >>>>>>> store double %x, double* %x.addr, align 8 > >>>>>>> %0 = load double, double* %x.addr, align 8 > >>>>>>> store double %0, double* %__a.addr.i, align 8 > >>>>>>> %1 = load double, double* %__a.addr.i, align 8 > >>>>>>> %call.i = call contract double @__nv_cos(double %1) #7 > >>>>>>> ret double %call.i > >>>>>>> } > >>>>>>> > >>>>>>> define internal double @__nv_cos(double %a) #1 { > >>>>>>> %q.i = alloca i32, align 4 > >>>>>>> ``` > >>>>>>> > >>>>>>> Obviously we would need to do something to ensure these functions > >> don't > >>>>>> get > >>>>>>> deleted prior to their use in lowering from intrinsic to libdevice. > >>>>>>> ... > >>>>>>> > >>>>>>> > >>>>>>> On Wed, Mar 10, 2021 at 3:39 PM Artem Belevich <tra at google.com> > >> wrote: > >>>>>>>> On Wed, Mar 10, 2021 at 11:41 AM Johannes Doerfert < > >>>>>>>> johannesdoerfert at gmail.com> wrote: > >>>>>>>> > >>>>>>>>> Artem, Justin, > >>>>>>>>> > >>>>>>>>> I am running into a problem and I'm curious if I'm missing > >> something > >>>> or > >>>>>>>>> if the support is simply missing. > >>>>>>>>> Am I correct to assume the NVPTX backend does not deal with > >>>> `llvm.sin` > >>>>>>>>> and friends? > >>>>>>>>> > >>>>>>>> Correct. It can't deal with anything that may need to lower to a > >>>>>> standard > >>>>>>>> library call. > >>>>>>>> > >>>>>>>>> This is what I see, with some variations: > >>>> https://godbolt.org/z/PxsEWs > >>>>>>>>> If this is missing in the backend, is there a plan to get this > >>>> working, > >>>>>>>>> I'd really like to have the > >>>>>>>>> intrinsics in the middle end rather than __nv_cos, not to mention > >>>> that > >>>>>>>>> -ffast-math does emit intrinsics > >>>>>>>>> and crashes. > >>>>>>>>> > >>>>>>>> It all boils down to the fact that PTX does not have the standard > >>>>>>>> libc/libm which LLVM could lower the calls to, nor does it have a > >>>>>> 'linking' > >>>>>>>> phase where we could link such a library in, if we had it. > >>>>>>>> > >>>>>>>> Libdevice bitcode does provide the implementations for some of the > >>>>>>>> functions (though with a __nv_ prefix) and clang links it in in > >> order > >>>> to > >>>>>>>> avoid generating IR that LLVM can't handle, but that's a > workaround > >>>> that > >>>>>>>> does not help LLVM itself. > >>>>>>>> > >>>>>>>> --Artem > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>>> ~ Johannes > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> -- > >>>>>>>>> ─────────────────── > >>>>>>>>> ∽ Johannes (he/his) > >>>>>>>>> > >>>>>>>>> > >>>>>>>> -- > >>>>>>>> --Artem Belevich > >>>>>>>> > > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20210310/5ba357eb/attachment.html>
Johannes Doerfert via llvm-dev
2021-Mar-11 18:54 UTC
[llvm-dev] NVPTX codegen for llvm.sin (and friends)
I certainly agree we should try to avoid a hard-coded mapping in C++. I could see something like: ``` __attribute__((implementation("llvm.cos")) double __nv_cos(...) { ... } ``` and a pass that transforms all calls to a function with an "implementation" to calls to that implementation. Maybe later we attach a score/priority ;) On 3/10/21 8:44 PM, William Moses wrote:> We could also consider doing something slightly broader. > > For example we could define a special attribute on top of the llvm.cos > call/declaration etc with metadata or an attribute that points to the > actual __nv_cos function. Then in a subsequent lowering pass the > corresponding intrinsic with the relevant attribute has its uses replaced > by the actual function. > > > On Wed, Mar 10, 2021 at 7:57 PM Johannes Doerfert < > johannesdoerfert at gmail.com> wrote: > >> On 3/10/21 6:22 PM, Artem Belevich wrote: >>> On Wed, Mar 10, 2021 at 3:44 PM Johannes Doerfert < >>> johannesdoerfert at gmail.com> wrote: >>> >>>> On 3/10/21 4:38 PM, Artem Belevich wrote: >>>>> On Wed, Mar 10, 2021 at 1:55 PM Johannes Doerfert < >>>>> johannesdoerfert at gmail.com> wrote: >>>>> >>>>>> On 3/10/21 3:25 PM, Artem Belevich wrote: >>>>>>> On Wed, Mar 10, 2021 at 12:57 PM Johannes Doerfert < >>>>>>> johannesdoerfert at gmail.com> wrote: >>>>>>> >>>>>>>> Right. We could keep the definition of __nv_cos and friends >>>>>>>> around. Right now, -ffast-math might just crash on the user, >>>>>>>> which is arguably a bad thing. I can also see us benefiting >>>>>>>> in various other ways from llvm.cos uses instead of __nv_cos >>>>>>>> (assuming precision is according to the user requirements but >>>>>>>> that is always a condition). >>>>>>>> >>>>>>>> It could be as simple as introducing __nv_cos into >>>>>>>> "llvm.used" and a backend matching/rewrite pass. >>>>>>>> >>>>>>>> If the backend knew the libdevice location it could even pick >>>>>>>> the definitions from there. Maybe we could link libdevice late >>>>>>>> instead of eager? >>>>>>>> >>>>>>> It's possible, but it would require plumbing in CUDA SDK awareness >> into >>>>>>> LLVM. While clang driver can deal with that, LLVM currently can't. >> The >>>>>>> bitcode library path would have to be provided by the user. >>>>>> The PTX backend could arguably be CUDA SDK aware, IMHO, it would >>>>>> even be fine if the middle-end does the remapping to get inlining >>>>>> and folding benefits also after __nv_cos is used. See below. >>>>>> >>>>>> >>>>>>> The standard library as bitcode raises some questions. >>>>>> Which standard library? CUDAs libdevice is a bitcode library, right? >>>>>> >>>>> It's whatever LLVM will need to lower libcalls to. libdevice bitcode is >>>> the >>>>> closest approximation of that we have at the moment. >>>>> >>>>> >>>>>>> * When do we want to do the linking? If we do it at the beginning, >> then >>>>>> the >>>>>>> question is how to make sure unused functions are not eliminated >> before >>>>>> we >>>>>>> may need them, as we don't know apriori what's going to be needed. We >>>>>> also >>>>>>> do want the unused functions to be gone after we're done. Linking it >> in >>>>>>> early would allow optimizing the code better at the expense of having >>>> to >>>>>>> optimize a lot of code we'll throw away. Linking it in late has less >>>>>>> overhead, but leaves the linked in bitcode unoptimized, though it's >>>>>>> probably in the ballpark of what would happen with a real library >> call. >>>>>>> I.e. no inlining, etc. >>>>>>> >>>>>>> * It incorporates linking into LLVM, which is not LLVM's job. >> Arguably, >>>>>> the >>>>>>> line should be drawn at the lowering to libcalls as it's done for >> other >>>>>>> back-ends. However, we're also constrained to by the need to have the >>>>>>> linking done before we generate PTX which prevents doing it after >> LLVM >>>> is >>>>>>> done generating an object file. >>>>>> I'm confused. Clang links in libdevice.bc early. >>>>> Yes. Because that's where it has to happen if we want to keep LLVM >>>> unaware >>>>> of CUDA SDK. >>>>> It does not have to be the case if/when LLVM can do the linking itself. >>>>> >>>>> >>>>>> If we make sure >>>>>> `__nv_cos` is not deleted early, we can at any point "lower" >> `llvm.cos` >>>>>> to `__nv_cos` which is available. After the lowering we can remove >>>>>> the artificial uses of `__nv_XXX` functions that we used to keep the >>>>>> definitions around in order to remove them from the final result. >>>>>> >>>>> This is the 'link early' approach, I should've been explicit that it's >>>>> 'link early *everything*' as opposed to linking only what's needed at >> the >>>>> beginning. >>>>> It would work at the expense of having to process/optimize 500KB worth >> of >>>>> bitcode for every compilation, whether it needs it or not. >>>>> >>>>> >>>>>> We get the benefit of having `llvm.cos` for some of the pipeline, >>>>>> we know it does not have all the bad effects while `__nv_cos` is >> defined >>>>>> with inline assembly. We also get the benefit of inlining `__nv_cos` >>>>>> and folding the implementation based on the arguments. Finally, >>>>>> this should work with the existing pipeline, the linking is the same >>>>>> as before, all we do is to keep the definitions alive longer and >>>>>> lower `llvm.cos` to `__nv_cos` in a middle end pass. >>>>>> >>>>> Again, I agree that it is doable. >>>>> >>>>> >>>>> >>>>>> This might be similar to the PTX solution you describe below but I >> feel >>>>>> we get the inline benefit from this without actually changing the >>>> pipeline >>>>>> at all. >>>>>> >>>>> So, to summarize: >>>>> * link the library as bitcode early, add artificial placeholders for >>>>> everything, compile, remove placeholders and DCE unused stuff away. >>>>> Pros: >>>>> - we're already doing most of it before clang hands hands off >> IR to >>>>> LLVM, so it just pushes it a bit lower in the compilation. >>>>> Cons: >>>>> - runtime cost of optimizing libdevice bitcode, >>>>> - libdevice may be required for all NVPTX compilations? >>>>> >>>>> * link the library as bitcode late. >>>>> Pros: >>>>> - lower runtime cost than link-early approach. >>>>> Cons: >>>>> - We'll need to make sure that NVVMReflect pass processes the >>>> library. >>>>> - less optimizations on the library functions. Some of the code >>>> gets >>>>> DCE'ed away after NVVMReflect and the rest could be optimized better. >>>>> - libdevice may be required for all NVPTX compilations? >>>>> * 'link' with the library as PTX appended as text to LLVM's output and >>>> let >>>>> ptxas do the 'linking' >>>>> Pros: LLVM remains agnostic of CUDA SDK installation details. All >> it >>>>> does is allows lowering libcalls and leaves their resolution to the >>>>> external tools. >>>>> Cons: Need to have the PTX library somewhere and need to integrate >> the >>>>> 'linking' into the compilation process somehow. >>>>> >>>>> Neither is particularly good. If the runtime overhead of link-early is >>>>> acceptable, then it may be a winner here, by a very small margin. >>>>> link-as-PTX may be better conceptually as it keeps linking and >>>> compilation >>>>> separate. >>>>> >>>>> As for the practical steps, here's what we need: >>>>> - allow libcall lowering in NVPTX, possibly guarded by a flag. This is >>>>> needed for all of the approaches above. >>>>> - teach LLVM how to link in bitcode (and, possibly, control early/late >>>> mode) >>>>> - teach clang driver to delegate libdevice linking to LLVM. >>>>> >>>>> This will allow us to experiment with all three approaches and see what >>>>> works best. >>>> I think if we embed knowledge about the nv_XXX functions we can >>>> even get away without the cons you listed for early linking above. >>>> >>> WDYM by `embed knowledge about the nv_XXX functions`? By linking those >>> functions in? Of do you mean that we should just declare them >>> before/instead of linking libdevice in? >> I mean by providing the "libcall lowering" pass. So the knowledge >> that llvm.cos maps to __nv_cos. >> >>> >>>> For early link I'm assuming an order similar to [0] but I also discuss >>>> the case where we don't link libdevice early for a TU. >>>> >>> That link just describes the steps needed to use libdevice. It does not >>> deal with how/where it fits in the LLVM pipeline. >>> The gist is that NVVMreflect replaces some conditionals with constants. >>> libdevice uses that as a poor man's IR preprocessor, conditionally >> enabling >>> different implementations and relying on DCE and constant folding to >> remove >>> unused parts and eliminate the now useless branches. >>> While running NVVM alone will make libdevice code valid and usable, it >>> would still benefit from further optimizations. I do not know to what >>> degree, though. >>> >>> >>>> Link early: >>>> 1) clang emits module.bc and links in libdevice.bc but with the >>>> `optnone`, `noinline`, and "used" attribute for functions in >>>> libdevice. ("used" is not an attribute but could as well be.) >>>> At this stage module.bc might call __nv_XXX or llvm.XXX freely >>>> as defined by -ffast-math and friends. >>>> >>> That could work. Just carrying extra IR around would probably be OK. >>> We may want to do NVVMReflect as soon as we have it linked in and, maybe, >>> allow optimizing the functions that are explicitly used already. >> Right. NVVMReflect can be run twice and with `alwaysinline` >> on the call sites of __nv_XXX functions we will actually >> inline and optimize them while the definitions are just "dragged >> along" in case we need them later. >> >> >>>> 2) Run some optimizations in the middle end, maybe till the end of >>>> the inliner loop, unsure. >>>> 3) Run a libcall lowering pass and another NVVMReflect pass (or the >>>> only instance thereof). We effectively remove all llvm.XXX calls >>> in favor of __nv_XXX now. Note that we haven't spend (much) time >>>> on the libdevice code as it is optnone and most passes are good >>>> at skipping those. To me, it's unclear if the used parts should >>>> not be optimized before we inline them anyway to avoid redoing >>>> the optimizations over and over (per call site). That needs >>>> measuring I guess. Also note that we can still retain the current >>>> behavior for direct calls to __nv_XXX if we mark the call sites >>>> as `alwaysinline`, or at least the behavior is almost like the >>>> current one is. >>>> 4) Run an always inliner pass on the __nv_XXX calls because it is >>>> something we would do right now. Alternatively, remove `optnone` >>>> and `noinline` from the __nv_XXX calls. >>>> 5) Continue with the pipeline as before. >>>> >>>> >>> SGTM. >>> >>> >>>> As mentioned above, `optnone` avoids spending time on the libdevice >>>> until we "activate" it. At that point (globals) DCE can be scheduled >>>> to remove all unused parts right away. I don't think this is (much) >>>> more expensive than linking libdevice early right now. >>>> >>>> Link late, aka. translation units without libdevice: >>>> 1) clang emits module.bc but does not link in libdevice.bc, it will be >>>> made available later. We still can mix __nv_XXX and llvm.XXX calls >>>> freely as above. >>>> 2) Same as above. >>>> 3) Same as above. >>>> 4) Same as above but effectively a no-op, no __nv_XXX definitions are >>>> available. >>>> 5) Same as above. >>>> >>>> >>>> I might misunderstand something about the current pipeline but from [0] >>>> and the experiments I run locally it looks like the above should cover >> all >>>> the cases. WDYT? >>>> >>>> >>> The `optnone` trick may indeed remove much of the practical differences >>> between the early/late approaches. >>> In principle it should work. >>> >>> Next question is -- is libdevice sufficient to satisfy LLVM's assumptions >>> about the standard library. >>> While it does provide most of the equivalents of libm functions, the set >> is >>> not complete and some of the functions differ from their libm >> counterparts. >>> The differences are minor, so we should be able to deal with it by >>> generating few wrapper functions for the odd cases. >>> Here's what clang does to provide math functions using libdevice: >>> >> https://github.com/llvm/llvm-project/blob/main/clang/lib/Headers/__clang_cuda_math.h >> >> Right now, clang will generate any llvm intrinsic and we crash, so anything >> else is probably a step in the right direction. Eventually, we should >> "lower" >> all intrinsics that the NVPTX backend can't handle or at least emit a nice >> error message. Preferably, clang would know what we can't deal with and not >> generate intinsic calls for those in the first place. >> >> >>> The most concerning aspect of libdevice is that we don't know when we'll >> no >>> longer be able to use the libdevice bitcode? My understanding is that IR >>> does not guarantee binary stability and at some point we may just be >> unable >>> to use it. Ideally we need our own libm for GPUs. >> For OpenMP I did my best to avoid writing libm (code) for GPUs by piggy >> backing on CUDA and libc++ implementations, I hope it will stay that way. >> That said, if the need arises we might really have to port libc++ to the >> GPUs. >> >> Back to the problem with libdevice. I agree that the solution of NVIDIA >> to ship a .bc library is suboptimal but with the existing, or an extended, >> auto-upgrader we might be able to make that work reasonably well for the >> foreseeable future. That problem is orthogonal to what we are discussing >> above, I think. >> >> ~ Johannes >> >> >>> --Artem >>> >>> >>>> ~ Johannes >>>> >>>> >>>> P.S. If the rewrite capability (aka libcall lowering) is generic we >> could >>>> use the scheme for many other things as well. >>>> >>>> >>>> [0] https://llvm.org/docs/NVPTXUsage.html#linking-with-libdevice >>>> >>>> >>>>> --Artem >>>>> >>>>> >>>>>> ~ Johannes >>>>>> >>>>>> >>>>>>> One thing that may work within the existing compilation model is to >>>>>>> pre-compile the standard library into PTX and then textually embed >>>>>> relevant >>>>>>> functions into the generated PTX, thus pushing the 'linking' phase >> past >>>>>> the >>>>>>> end of LLVM's compilation and make it look closer to the standard >>>>>>> compile/link process. This way we'd only enable libcall lowering in >>>>>> NVPTX, >>>>>>> assuming that the library functions will be magically available out >>>>>> there. >>>>>>> Injection of PTX could be done with an external script outside of >> LLVM >>>>>> and >>>>>>> it could be incorporated into clang driver. Bonus points for the fact >>>>>> that >>>>>>> this scheme is compatible with -fgpu-rdc out of the box -- assemble >> the >>>>>> PTX >>>>>>> with `ptxas -rdc` and then actually link with the library, instead of >>>>>>> injecting its PTX before invoking ptxas. >>>>>>> >>>>>>> --Artem >>>>>>> >>>>>>> Trying to figure out a good way to have the cake and eat it too. >>>>>>>> ~ Johannes >>>>>>>> >>>>>>>> >>>>>>>> On 3/10/21 2:49 PM, William Moses wrote: >>>>>>>>> Since clang (and arguably any other frontend that uses) should link >>>> in >>>>>>>>> libdevice, could we lower these intrinsics to the libdevice code? >>>>>>> The linking happens *before* LLVM gets to work on IR. >>>>>>> As I said, it's a workaround, not the solution. It's possible for >> LLVM >>>> to >>>>>>> still attempt lowering something in the IR into a libcall and we >> would >>>>>> not >>>>>>> be able to deal with that. It happens to work well enough in >> practice. >>>>>>> Do you have an example where you see the problem with -ffast-math? >>>>>>> >>>>>>> >>>>>>> >>>>>>>>> For example, consider compiling the simple device function below: >>>>>>>>> >>>>>>>>> ``` >>>>>>>>> // /mnt/sabrent/wmoses/llvm13/build/bin/clang tmp.cu -S -emit-llvm >>>>>>>>> --cuda-path=/usr/local/cuda-11.0 -L/usr/local/cuda-11.0/lib64 >>>>>>>>> --cuda-gpu-arch=sm_37 >>>>>>>>> __device__ double f(double x) { >>>>>>>>> return cos(x); >>>>>>>>> } >>>>>>>>> ``` >>>>>>>>> >>>>>>>>> The LLVM module for it is as follows: >>>>>>>>> >>>>>>>>> ``` >>>>>>>>> ... >>>>>>>>> define dso_local double @_Z1fd(double %x) #0 { >>>>>>>>> entry: >>>>>>>>> %__a.addr.i = alloca double, align 8 >>>>>>>>> %x.addr = alloca double, align 8 >>>>>>>>> store double %x, double* %x.addr, align 8 >>>>>>>>> %0 = load double, double* %x.addr, align 8 >>>>>>>>> store double %0, double* %__a.addr.i, align 8 >>>>>>>>> %1 = load double, double* %__a.addr.i, align 8 >>>>>>>>> %call.i = call contract double @__nv_cos(double %1) #7 >>>>>>>>> ret double %call.i >>>>>>>>> } >>>>>>>>> >>>>>>>>> define internal double @__nv_cos(double %a) #1 { >>>>>>>>> %q.i = alloca i32, align 4 >>>>>>>>> ``` >>>>>>>>> >>>>>>>>> Obviously we would need to do something to ensure these functions >>>> don't >>>>>>>> get >>>>>>>>> deleted prior to their use in lowering from intrinsic to libdevice. >>>>>>>>> ... >>>>>>>>> >>>>>>>>> >>>>>>>>> On Wed, Mar 10, 2021 at 3:39 PM Artem Belevich <tra at google.com> >>>> wrote: >>>>>>>>>> On Wed, Mar 10, 2021 at 11:41 AM Johannes Doerfert < >>>>>>>>>> johannesdoerfert at gmail.com> wrote: >>>>>>>>>> >>>>>>>>>>> Artem, Justin, >>>>>>>>>>> >>>>>>>>>>> I am running into a problem and I'm curious if I'm missing >>>> something >>>>>> or >>>>>>>>>>> if the support is simply missing. >>>>>>>>>>> Am I correct to assume the NVPTX backend does not deal with >>>>>> `llvm.sin` >>>>>>>>>>> and friends? >>>>>>>>>>> >>>>>>>>>> Correct. It can't deal with anything that may need to lower to a >>>>>>>> standard >>>>>>>>>> library call. >>>>>>>>>> >>>>>>>>>>> This is what I see, with some variations: >>>>>> https://godbolt.org/z/PxsEWs >>>>>>>>>>> If this is missing in the backend, is there a plan to get this >>>>>> working, >>>>>>>>>>> I'd really like to have the >>>>>>>>>>> intrinsics in the middle end rather than __nv_cos, not to mention >>>>>> that >>>>>>>>>>> -ffast-math does emit intrinsics >>>>>>>>>>> and crashes. >>>>>>>>>>> >>>>>>>>>> It all boils down to the fact that PTX does not have the standard >>>>>>>>>> libc/libm which LLVM could lower the calls to, nor does it have a >>>>>>>> 'linking' >>>>>>>>>> phase where we could link such a library in, if we had it. >>>>>>>>>> >>>>>>>>>> Libdevice bitcode does provide the implementations for some of the >>>>>>>>>> functions (though with a __nv_ prefix) and clang links it in in >>>> order >>>>>> to >>>>>>>>>> avoid generating IR that LLVM can't handle, but that's a >> workaround >>>>>> that >>>>>>>>>> does not help LLVM itself. >>>>>>>>>> >>>>>>>>>> --Artem >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> ~ Johannes >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> ─────────────────── >>>>>>>>>>> ∽ Johannes (he/his) >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> --Artem Belevich >>>>>>>>>>