Yuanfeng Peng via llvm-dev
2016-Mar-13 23:13 UTC
[llvm-dev] instrumenting device code with gpucc
Hey Jingyue, Thanks for being so responsive! I finally figured out a way to resolve the issue: all I have to do is to use `-only-needed` when merging the device bitcodes with llvm-link. However, since we actually need to instrument the host code as well, I encountered another issue when I tried to glue the instrumented host code and fatbin together. When I only instrumented the device code, I used the following cmd to do so: "/mnt/wtf/tools/bin/clang-3.9" "-cc1" "-triple" "x86_64-unknown-linux-gnu" "-aux-triple" "nvptx64-nvidia-cuda" "-fcuda-target-overloads" "-fcuda-disable-target-call-checks" "-emit-obj" "-disable-free" "-main-file-name" "axpy.cu" "-mrelocation-model" "static" "-mthread-model" "posix" "-fmath-errno" "-masm-verbose" "-mconstructor-aliases" "-munwind-tables" "-fuse-init-array" "-target-cpu" "x86-64" "-momit-leaf-frame-pointer" "-dwarf-column-info" "-debugger-tuning=gdb" "-resource-dir" "/mnt/wtf/tools/bin/../lib/clang/3.9.0" "-I" "/usr/local/cuda-7.0/samples/common/inc" "-internal-isystem" "/usr/lib/gcc/x86_64-linux-gnu/4.8/../../../../include/c++/4.8" "-internal-isystem" "/usr/lib/gcc/x86_64-linux-gnu/4.8/../../../../include/x86_64-linux-gnu/c++/4.8" "-internal-isystem" "/usr/lib/gcc/x86_64-linux-gnu/4.8/../../../../include/x86_64-linux-gnu/c++/4.8" "-internal-isystem" "/usr/lib/gcc/x86_64-linux-gnu/4.8/../../../../include/c++/4.8/backward" "-internal-isystem" "/usr/local/include" "-internal-isystem" "/mnt/wtf/tools/bin/../lib/clang/3.9.0/include" "-internal-externc-isystem" "/usr/include/x86_64-linux-gnu" "-internal-externc-isystem" "/include" "-internal-externc-isystem" "/usr/include" "-internal-isystem" "/usr/local/cuda/include" "-include" "__clang_cuda_runtime_wrapper.h" "-O3" "-fdeprecated-macro" "-fdebug-compilation-dir" "/mnt/wtf/workspace/cuda/gpu-race-detection" "-ferror-limit" "19" "-fmessage-length" "291" "-pthread" "-fobjc-runtime=gcc" "-fcxx-exceptions" "-fexceptions" "-fdiagnostics-show-option" "-vectorize-loops" "-vectorize-slp" "-o" "axpy-host.o" "-x" "cuda" "tests/axpy.cu" "-fcuda-include-gpubinary" "axpy-sm_30.fatbin" which, from my understanding, compiles the host code in tests/axpy.cu and link it with axpy-sm_30.fatbin. However, now that I instrumented the IR of the host code (axpy.bc) and did `llc axpy.bc -o axpy.s`, which cmd should I use to link axpy.s with axpy-sm_30.fatbin? I tried to use -cc1as, but the flag '-fcuda-include-gpubinary' was not recognized. Thanks! yuanfeng On Sat, Mar 12, 2016 at 12:05 AM, Jingyue Wu <jingyue at google.com> wrote:> I've no idea. Without instrumentation, nvvm_reflect_anchor doesn't appear > in the final PTX, right? If that's the case, some pass in llc must have > deleted the anchor and you should be able to figure out which one. > > On Fri, Mar 11, 2016 at 4:56 PM, Yuanfeng Peng < > yuanfeng.jack.peng at gmail.com> wrote: > >> Hey Jingyue, >> >> Though I tried `opt -nvvm-reflect` on both bc files, the nvvm reflect >> anchor didn't go away; ptxas is still complaining about the duplicate >> definition of of function '_ZL21__nvvm_reflect_anchorv' . Did I misused >> the nvvm-reflect pass? >> >> Thanks! >> yuanfeng >> >> On Fri, Mar 11, 2016 at 10:10 AM, Jingyue Wu <jingyue at google.com> wrote: >> >>> According to the examples you sent, I believe the linking issue was >>> caused by nvvm reflection anchors. I haven't played with that, but I guess >>> running nvvm-reflect on an IR removes the nvvm reflect anchors. After that, >>> you can llvm-link the two bc/ll files. >>> >>> Another potential issue is that your cuda_hooks-sm_30.ll is unoptimized. >>> This could cause the instrumented code to run super slow. >>> >>> On Fri, Mar 11, 2016 at 9:40 AM, Yuanfeng Peng < >>> yuanfeng.jack.peng at gmail.com> wrote: >>> >>>> Hey Jingyue, >>>> >>>> Attached are the .ll files. Thanks! >>>> >>>> yuanfeng >>>> >>>> On Fri, Mar 11, 2016 at 3:47 AM, Jingyue Wu <jingyue at google.com> wrote: >>>> >>>>> Looks like we are getting closer! >>>>> >>>>> On Thu, Mar 10, 2016 at 5:21 PM, Yuanfeng Peng < >>>>> yuanfeng.jack.peng at gmail.com> wrote: >>>>> >>>>>> Hi Jingyue, >>>>>> >>>>>> Thank you so much for the helpful response! I didn't know that PTX >>>>>> assembly cannot be linked; that's likely the reason for my issue. >>>>>> >>>>>> So I did the following as you suggested(axpy-sm_30.bc is the >>>>>> instrumented bitcode, and cuda_hooks-sm_30.bc contains the hook functions): >>>>>> >>>>>> *llvm-link axpy-sm_30.bc cuda_hooks-sm_30.bc -o inst_axpy-sm_30.bc* >>>>>> >>>>>> *llc inst_axpy-sm_30.bc -o axpy-sm_30.s* >>>>>> >>>>>> *"/usr/local/cuda/bin/ptxas" "-m64" "-O3" -c "--gpu-name" "sm_30" >>>>>> "--output-file" axpy-sm_30.o axpy-sm_30.s* >>>>>> >>>>>> However, I got the following error from ptxas: >>>>>> >>>>>> *ptxas axpy-sm_30.s, line 106; error : Duplicate definition of >>>>>> function '_ZL21__nvvm_reflect_anchorv'* >>>>>> >>>>>> *ptxas axpy-sm_30.s, line 106; fatal : Parsing error near '.2': >>>>>> syntax error* >>>>>> >>>>>> *ptxas fatal : Ptx assembly aborted due to errors* >>>>>> >>>>>> Looks like some cuda function definitions are in both bitcode files >>>>>> which caused duplicate definition... what am I supposed to do to resolve >>>>>> this issue? >>>>>> >>>>> Can you attach axpy-sm_30.ll and cuda_hooks-sm_30.ll? The duplication >>>>> may be caused by how nvvm reflection works, but I'd like to see a concrete >>>>> example. >>>>> >>>>>> >>>>>> Thanks! >>>>>> >>>>>> yuanfeng >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>> >>>> >>> >> >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160313/649db42b/attachment.html>
Jingyue Wu via llvm-dev
2016-Mar-15 17:09 UTC
[llvm-dev] instrumenting device code with gpucc
Including fatbin into host code should be done in frontend. On Mon, Mar 14, 2016 at 12:13 AM, Yuanfeng Peng < yuanfeng.jack.peng at gmail.com> wrote:> Hey Jingyue, > > Thanks for being so responsive! I finally figured out a way to resolve > the issue: all I have to do is to use `-only-needed` when merging the > device bitcodes with llvm-link. > > However, since we actually need to instrument the host code as well, I > encountered another issue when I tried to glue the instrumented host code > and fatbin together. When I only instrumented the device code, I used the > following cmd to do so: > > "/mnt/wtf/tools/bin/clang-3.9" "-cc1" "-triple" "x86_64-unknown-linux-gnu" > "-aux-triple" "nvptx64-nvidia-cuda" "-fcuda-target-overloads" > "-fcuda-disable-target-call-checks" "-emit-obj" "-disable-free" > "-main-file-name" "axpy.cu" "-mrelocation-model" "static" > "-mthread-model" "posix" "-fmath-errno" "-masm-verbose" > "-mconstructor-aliases" "-munwind-tables" "-fuse-init-array" "-target-cpu" > "x86-64" "-momit-leaf-frame-pointer" "-dwarf-column-info" > "-debugger-tuning=gdb" "-resource-dir" > "/mnt/wtf/tools/bin/../lib/clang/3.9.0" "-I" > "/usr/local/cuda-7.0/samples/common/inc" "-internal-isystem" > "/usr/lib/gcc/x86_64-linux-gnu/4.8/../../../../include/c++/4.8" > "-internal-isystem" > "/usr/lib/gcc/x86_64-linux-gnu/4.8/../../../../include/x86_64-linux-gnu/c++/4.8" > "-internal-isystem" > "/usr/lib/gcc/x86_64-linux-gnu/4.8/../../../../include/x86_64-linux-gnu/c++/4.8" > "-internal-isystem" > "/usr/lib/gcc/x86_64-linux-gnu/4.8/../../../../include/c++/4.8/backward" > "-internal-isystem" "/usr/local/include" "-internal-isystem" > "/mnt/wtf/tools/bin/../lib/clang/3.9.0/include" "-internal-externc-isystem" > "/usr/include/x86_64-linux-gnu" "-internal-externc-isystem" "/include" > "-internal-externc-isystem" "/usr/include" "-internal-isystem" > "/usr/local/cuda/include" "-include" "__clang_cuda_runtime_wrapper.h" "-O3" > "-fdeprecated-macro" "-fdebug-compilation-dir" > "/mnt/wtf/workspace/cuda/gpu-race-detection" "-ferror-limit" "19" > "-fmessage-length" "291" "-pthread" "-fobjc-runtime=gcc" "-fcxx-exceptions" > "-fexceptions" "-fdiagnostics-show-option" "-vectorize-loops" > "-vectorize-slp" "-o" "axpy-host.o" "-x" "cuda" "tests/axpy.cu" > "-fcuda-include-gpubinary" "axpy-sm_30.fatbin" > > which, from my understanding, compiles the host code in tests/axpy.cu and > link it with axpy-sm_30.fatbin. However, now that I instrumented the IR of > the host code (axpy.bc) and did `llc axpy.bc -o axpy.s`, which cmd should I > use to link axpy.s with axpy-sm_30.fatbin? I tried to use -cc1as, but the > flag '-fcuda-include-gpubinary' was not recognized. > > Thanks! > > yuanfeng > > On Sat, Mar 12, 2016 at 12:05 AM, Jingyue Wu <jingyue at google.com> wrote: > >> I've no idea. Without instrumentation, nvvm_reflect_anchor doesn't appear >> in the final PTX, right? If that's the case, some pass in llc must have >> deleted the anchor and you should be able to figure out which one. >> >> On Fri, Mar 11, 2016 at 4:56 PM, Yuanfeng Peng < >> yuanfeng.jack.peng at gmail.com> wrote: >> >>> Hey Jingyue, >>> >>> Though I tried `opt -nvvm-reflect` on both bc files, the nvvm reflect >>> anchor didn't go away; ptxas is still complaining about the duplicate >>> definition of of function '_ZL21__nvvm_reflect_anchorv' . Did I misused >>> the nvvm-reflect pass? >>> >>> Thanks! >>> yuanfeng >>> >>> On Fri, Mar 11, 2016 at 10:10 AM, Jingyue Wu <jingyue at google.com> wrote: >>> >>>> According to the examples you sent, I believe the linking issue was >>>> caused by nvvm reflection anchors. I haven't played with that, but I guess >>>> running nvvm-reflect on an IR removes the nvvm reflect anchors. After that, >>>> you can llvm-link the two bc/ll files. >>>> >>>> Another potential issue is that your cuda_hooks-sm_30.ll is >>>> unoptimized. This could cause the instrumented code to run super slow. >>>> >>>> On Fri, Mar 11, 2016 at 9:40 AM, Yuanfeng Peng < >>>> yuanfeng.jack.peng at gmail.com> wrote: >>>> >>>>> Hey Jingyue, >>>>> >>>>> Attached are the .ll files. Thanks! >>>>> >>>>> yuanfeng >>>>> >>>>> On Fri, Mar 11, 2016 at 3:47 AM, Jingyue Wu <jingyue at google.com> >>>>> wrote: >>>>> >>>>>> Looks like we are getting closer! >>>>>> >>>>>> On Thu, Mar 10, 2016 at 5:21 PM, Yuanfeng Peng < >>>>>> yuanfeng.jack.peng at gmail.com> wrote: >>>>>> >>>>>>> Hi Jingyue, >>>>>>> >>>>>>> Thank you so much for the helpful response! I didn't know that PTX >>>>>>> assembly cannot be linked; that's likely the reason for my issue. >>>>>>> >>>>>>> So I did the following as you suggested(axpy-sm_30.bc is the >>>>>>> instrumented bitcode, and cuda_hooks-sm_30.bc contains the hook functions): >>>>>>> >>>>>>> *llvm-link axpy-sm_30.bc cuda_hooks-sm_30.bc -o inst_axpy-sm_30.bc* >>>>>>> >>>>>>> *llc inst_axpy-sm_30.bc -o axpy-sm_30.s* >>>>>>> >>>>>>> *"/usr/local/cuda/bin/ptxas" "-m64" "-O3" -c "--gpu-name" "sm_30" >>>>>>> "--output-file" axpy-sm_30.o axpy-sm_30.s* >>>>>>> >>>>>>> However, I got the following error from ptxas: >>>>>>> >>>>>>> *ptxas axpy-sm_30.s, line 106; error : Duplicate definition of >>>>>>> function '_ZL21__nvvm_reflect_anchorv'* >>>>>>> >>>>>>> *ptxas axpy-sm_30.s, line 106; fatal : Parsing error near '.2': >>>>>>> syntax error* >>>>>>> >>>>>>> *ptxas fatal : Ptx assembly aborted due to errors* >>>>>>> >>>>>>> Looks like some cuda function definitions are in both bitcode files >>>>>>> which caused duplicate definition... what am I supposed to do to resolve >>>>>>> this issue? >>>>>>> >>>>>> Can you attach axpy-sm_30.ll and cuda_hooks-sm_30.ll? The duplication >>>>>> may be caused by how nvvm reflection works, but I'd like to see a concrete >>>>>> example. >>>>>> >>>>>>> >>>>>>> Thanks! >>>>>>> >>>>>>> yuanfeng >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160315/f2a0711d/attachment.html>
Yuanfeng Peng via llvm-dev
2016-Mar-15 17:45 UTC
[llvm-dev] instrumenting device code with gpucc
Hi Jingyue, Sorry to ask again, but how exactly could I glue the fatbin with the instrumented host code? Or does it mean we actually cannot instrument both the host & device code at the same time? Thanks! yuanfeng On Tue, Mar 15, 2016 at 10:09 AM, Jingyue Wu <jingyue at google.com> wrote:> Including fatbin into host code should be done in frontend. > > On Mon, Mar 14, 2016 at 12:13 AM, Yuanfeng Peng < > yuanfeng.jack.peng at gmail.com> wrote: > >> Hey Jingyue, >> >> Thanks for being so responsive! I finally figured out a way to resolve >> the issue: all I have to do is to use `-only-needed` when merging the >> device bitcodes with llvm-link. >> >> However, since we actually need to instrument the host code as well, I >> encountered another issue when I tried to glue the instrumented host code >> and fatbin together. When I only instrumented the device code, I used the >> following cmd to do so: >> >> "/mnt/wtf/tools/bin/clang-3.9" "-cc1" "-triple" >> "x86_64-unknown-linux-gnu" "-aux-triple" "nvptx64-nvidia-cuda" >> "-fcuda-target-overloads" "-fcuda-disable-target-call-checks" "-emit-obj" >> "-disable-free" "-main-file-name" "axpy.cu" "-mrelocation-model" >> "static" "-mthread-model" "posix" "-fmath-errno" "-masm-verbose" >> "-mconstructor-aliases" "-munwind-tables" "-fuse-init-array" "-target-cpu" >> "x86-64" "-momit-leaf-frame-pointer" "-dwarf-column-info" >> "-debugger-tuning=gdb" "-resource-dir" >> "/mnt/wtf/tools/bin/../lib/clang/3.9.0" "-I" >> "/usr/local/cuda-7.0/samples/common/inc" "-internal-isystem" >> "/usr/lib/gcc/x86_64-linux-gnu/4.8/../../../../include/c++/4.8" >> "-internal-isystem" >> "/usr/lib/gcc/x86_64-linux-gnu/4.8/../../../../include/x86_64-linux-gnu/c++/4.8" >> "-internal-isystem" >> "/usr/lib/gcc/x86_64-linux-gnu/4.8/../../../../include/x86_64-linux-gnu/c++/4.8" >> "-internal-isystem" >> "/usr/lib/gcc/x86_64-linux-gnu/4.8/../../../../include/c++/4.8/backward" >> "-internal-isystem" "/usr/local/include" "-internal-isystem" >> "/mnt/wtf/tools/bin/../lib/clang/3.9.0/include" "-internal-externc-isystem" >> "/usr/include/x86_64-linux-gnu" "-internal-externc-isystem" "/include" >> "-internal-externc-isystem" "/usr/include" "-internal-isystem" >> "/usr/local/cuda/include" "-include" "__clang_cuda_runtime_wrapper.h" "-O3" >> "-fdeprecated-macro" "-fdebug-compilation-dir" >> "/mnt/wtf/workspace/cuda/gpu-race-detection" "-ferror-limit" "19" >> "-fmessage-length" "291" "-pthread" "-fobjc-runtime=gcc" "-fcxx-exceptions" >> "-fexceptions" "-fdiagnostics-show-option" "-vectorize-loops" >> "-vectorize-slp" "-o" "axpy-host.o" "-x" "cuda" "tests/axpy.cu" >> "-fcuda-include-gpubinary" "axpy-sm_30.fatbin" >> >> which, from my understanding, compiles the host code in tests/axpy.cu >> and link it with axpy-sm_30.fatbin. However, now that I instrumented the >> IR of the host code (axpy.bc) and did `llc axpy.bc -o axpy.s`, which cmd >> should I use to link axpy.s with axpy-sm_30.fatbin? I tried to use -cc1as, >> but the flag '-fcuda-include-gpubinary' was not recognized. >> >> Thanks! >> >> yuanfeng >> >> On Sat, Mar 12, 2016 at 12:05 AM, Jingyue Wu <jingyue at google.com> wrote: >> >>> I've no idea. Without instrumentation, nvvm_reflect_anchor doesn't >>> appear in the final PTX, right? If that's the case, some pass in llc must >>> have deleted the anchor and you should be able to figure out which one. >>> >>> On Fri, Mar 11, 2016 at 4:56 PM, Yuanfeng Peng < >>> yuanfeng.jack.peng at gmail.com> wrote: >>> >>>> Hey Jingyue, >>>> >>>> Though I tried `opt -nvvm-reflect` on both bc files, the nvvm reflect >>>> anchor didn't go away; ptxas is still complaining about the duplicate >>>> definition of of function '_ZL21__nvvm_reflect_anchorv' . Did I misused >>>> the nvvm-reflect pass? >>>> >>>> Thanks! >>>> yuanfeng >>>> >>>> On Fri, Mar 11, 2016 at 10:10 AM, Jingyue Wu <jingyue at google.com> >>>> wrote: >>>> >>>>> According to the examples you sent, I believe the linking issue was >>>>> caused by nvvm reflection anchors. I haven't played with that, but I guess >>>>> running nvvm-reflect on an IR removes the nvvm reflect anchors. After that, >>>>> you can llvm-link the two bc/ll files. >>>>> >>>>> Another potential issue is that your cuda_hooks-sm_30.ll is >>>>> unoptimized. This could cause the instrumented code to run super slow. >>>>> >>>>> On Fri, Mar 11, 2016 at 9:40 AM, Yuanfeng Peng < >>>>> yuanfeng.jack.peng at gmail.com> wrote: >>>>> >>>>>> Hey Jingyue, >>>>>> >>>>>> Attached are the .ll files. Thanks! >>>>>> >>>>>> yuanfeng >>>>>> >>>>>> On Fri, Mar 11, 2016 at 3:47 AM, Jingyue Wu <jingyue at google.com> >>>>>> wrote: >>>>>> >>>>>>> Looks like we are getting closer! >>>>>>> >>>>>>> On Thu, Mar 10, 2016 at 5:21 PM, Yuanfeng Peng < >>>>>>> yuanfeng.jack.peng at gmail.com> wrote: >>>>>>> >>>>>>>> Hi Jingyue, >>>>>>>> >>>>>>>> Thank you so much for the helpful response! I didn't know that PTX >>>>>>>> assembly cannot be linked; that's likely the reason for my issue. >>>>>>>> >>>>>>>> So I did the following as you suggested(axpy-sm_30.bc is the >>>>>>>> instrumented bitcode, and cuda_hooks-sm_30.bc contains the hook functions): >>>>>>>> >>>>>>>> *llvm-link axpy-sm_30.bc cuda_hooks-sm_30.bc -o inst_axpy-sm_30.bc* >>>>>>>> >>>>>>>> *llc inst_axpy-sm_30.bc -o axpy-sm_30.s* >>>>>>>> >>>>>>>> *"/usr/local/cuda/bin/ptxas" "-m64" "-O3" -c "--gpu-name" "sm_30" >>>>>>>> "--output-file" axpy-sm_30.o axpy-sm_30.s* >>>>>>>> >>>>>>>> However, I got the following error from ptxas: >>>>>>>> >>>>>>>> *ptxas axpy-sm_30.s, line 106; error : Duplicate definition of >>>>>>>> function '_ZL21__nvvm_reflect_anchorv'* >>>>>>>> >>>>>>>> *ptxas axpy-sm_30.s, line 106; fatal : Parsing error near '.2': >>>>>>>> syntax error* >>>>>>>> >>>>>>>> *ptxas fatal : Ptx assembly aborted due to errors* >>>>>>>> >>>>>>>> Looks like some cuda function definitions are in both bitcode files >>>>>>>> which caused duplicate definition... what am I supposed to do to resolve >>>>>>>> this issue? >>>>>>>> >>>>>>> Can you attach axpy-sm_30.ll and cuda_hooks-sm_30.ll? The >>>>>>> duplication may be caused by how nvvm reflection works, but I'd like to see >>>>>>> a concrete example. >>>>>>> >>>>>>>> >>>>>>>> Thanks! >>>>>>>> >>>>>>>> yuanfeng >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160315/dfaae562/attachment-0001.html>