thr3ads.net - llvm dev - [llvm-dev] instrumenting device code with gpucc [Mar 2016]

If this information is useful, please help other people find it:
Share via:

Jingyue Wu via llvm-dev

2016-Mar-10 21:19 UTC

[llvm-dev] instrumenting device code with gpucc

It's hard to tell what is wrong without a concrete example. E.g., what is
the program you are instrumenting? What is the definition of the hook
function? How did you link that definition with the binary?

One thing suspicious to me is that you may have linked the definition of
_Cool_MemRead_Hook as a host function instead of a device function. AFAIK,
PTX assembly cannot be linked. So, if you want that hook function called
from your device code, you should merge the IR of the hook function and the
IR of your device code into one IR (via linking or direct IR emitting)
before the IR to PTX.

On Wed, Mar 9, 2016 at 4:31 PM, Yuanfeng Peng <yuanfeng.jack.peng at
gmail.com>
wrote:
> Hi Jingyue,
>
> Thanks for the instructions! I instrumented the device code and got a
> binary of axpy.cu; however, the resulting executable always fails on the
> first cudaMalloc call in host code (the kernel had not even been launched
> yet), with the error code being 30 (cudaErrorUnknown).  In my
> instrumentation pass, I only inserted a hook function upon each access  to
> device memory, with their signatures being: "__device__ void
> _Cool_MemRead_Hook(uint64_t addr)".  I've compiled these hooks
functions
>  into a shared object, and linked the axpy binary with it.
>
> I'm really sorry to bother you again, but I wonder whether any step I
did
> was apparently wrong, or there's any gpucc-specific step I need to do
when
> instrumenting a kernel?
>
> Thanks!
> yuanfeng
>
>
>
> On Fri, Mar 4, 2016 at 7:56 PM, Jingyue Wu <jingyue at google.com>
wrote:
>
>>
>>
>> On Fri, Mar 4, 2016 at 5:50 PM, Yuanfeng Peng <
>> yuanfeng.jack.peng at gmail.com> wrote:
>>
>>> Hi Jingyue,
>>>
>>> My name is Yuanfeng Peng, I'm a PhD student at UPenn.  I'm
sorry to
>>> bother you, but I'm having trouble with gpucc in my project,
and I would be
>>> really grateful for your help!
>>>
>>> Currently we're trying to instrument CUDA code using LLVM 3.9,
and I've
>>> written a pass to insert hook functions for certain function calls
and
>>> memory accesses.  For example, given a CUDA program, say, axpy.cu,
I
>>> first compile it with
>>>
>>>  clang++ -emit-llvm -c axpy.cu,
>>>
>>> which gives me two bitcode files, axpy.bc and axpy-sm_20.bc.  Then
I use
>>> opt to load my pass and insert the hook functions to axpy.bc, which
works
>>> fine.  After inspecting the instrumented axpy.bc, I noticed that
the kernel
>>> code was not there; rather, it lived inside axpy-sm_20.bc, so I
also load
>>> my pass to instrument axpy-sm_20.bc.
>>>
>>
>> Expected. axpy.bc contains host code, and axpy-sm_??.bc contains device
>> code. If you only want to instrument the device side, you don't
need to
>> modify axpy.bc.
>>
>>
>>>
>>>  However, after instrumenting axpy-sm_20.bc, I don't know how
could I
>>> combine the host bitcode & device bitcode into a single
binary... When I
>>> used llc to compile axpy-sm_20.bc into native code, I always got a
bunch of
>>> errors; if I only do llc axpy.bc -o axpy.s and then link axpy.s
with the
>>> necessary libraries, I got a working binary, but only the host code
was
>>> instrumented.
>>>
>>> So what should I do to get a binary where the device code is also
>>> instrumented?
>>>
>>
>>
>> To link the modified axpy-sm_20.bc to the final binary, you need
several
>> extra steps:
>> 1. Compile axpy-sm_20.bc to PTX assembly using llc: llc axpy-sm_20.bc
-o
>> axpy-sm_20.ptx -march=<nvptx or nvptx64>
>> 2. Compile the PTX assembly to SASS using ptxas
>> 3. Make the SASS a fat binary using NVIDIA's fatbinary tool
>> 4. Link the fat binary to the host code using ld.
>>
>> Clang does step 2-4 by invoking subcommands. Therefore, you can use
>> "clang -###" to dump all the subcommands, and then find the
ones for step
>> 2-4. For example,
>>
>> $ clang++ -### -O3 axpy.cu -I/usr/local/cuda/samples/common/inc
>> -L/usr/local/cuda/lib64 -lcudart_static -lcuda -ldl -lrt -pthread
>> --cuda-gpu-arch=sm_35
>>
>> clang version 3.9.0 (http://llvm.org/git/clang.git
>> 4ce165e39e7b185e394aa713d9adffd920288988) (http://llvm.org/git/llvm.git
>> 2550ef485b6f9668bb7a4daa7ab276b6501492df)
>> Target: x86_64-unknown-linux-gnu
>> Thread model: posix
>> InstalledDir: /usr/local/google/home/jingyue/Work/llvm/install/bin
>> 
"/usr/local/google/home/jingyue/Work/llvm/install-git/bin/clang-3.7"
>> "-cc1" "-triple" "nvptx64-nvidia-cuda"
"-aux-triple"
>> "x86_64-unknown-linux-gnu"
"-fcuda-target-overloads"
>> "-fcuda-disable-target-call-checks" "-S"
"-disable-free" "-main-file-name" "
>> axpy.cu" "-mrelocation-model" "static"
"-mthread-model" "posix"
>> "-mdisable-fp-elim" "-fmath-errno"
"-no-integrated-as" "-fcuda-is-device"
>> "-mlink-cuda-bitcode"
>> "/usr/local/cuda/nvvm/libdevice/libdevice.compute_35.10.bc"
>> "-target-feature" "+ptx42" "-target-cpu"
"sm_35" "-dwarf-column-info"
>> "-debugger-tuning=gdb" "-resource-dir"
>>
"/usr/local/google/home/jingyue/Work/llvm/install-git/bin/../lib/clang/3.9.0"
>> "-I" "/usr/local/cuda/samples/common/inc"
"-internal-isystem"
>>
"/usr/lib/gcc/x86_64-linux-gnu/4.8/../../../../include/c++/4.8"
>> "-internal-isystem"
>>
"/usr/lib/gcc/x86_64-linux-gnu/4.8/../../../../include/x86_64-linux-gnu/c++/4.8"
>> "-internal-isystem"
>>
"/usr/lib/gcc/x86_64-linux-gnu/4.8/../../../../include/x86_64-linux-gnu/c++/4.8"
>> "-internal-isystem"
>>
"/usr/lib/gcc/x86_64-linux-gnu/4.8/../../../../include/c++/4.8/backward"
>> "-internal-isystem" "/usr/local/include"
"-internal-isystem"
>>
"/usr/local/google/home/jingyue/Work/llvm/install-git/bin/../lib/clang/3.9.0/include"
>> "-internal-externc-isystem" "/include"
"-internal-externc-isystem"
>> "/usr/include" "-internal-isystem"
>>
"/usr/lib/gcc/x86_64-linux-gnu/4.8/../../../../include/c++/4.8"
>> "-internal-isystem"
>>
"/usr/lib/gcc/x86_64-linux-gnu/4.8/../../../../include/x86_64-linux-gnu/c++/4.8"
>> "-internal-isystem"
>>
"/usr/lib/gcc/x86_64-linux-gnu/4.8/../../../../include/x86_64-linux-gnu/c++/4.8"
>> "-internal-isystem"
>>
"/usr/lib/gcc/x86_64-linux-gnu/4.8/../../../../include/c++/4.8/backward"
>> "-internal-isystem" "/usr/local/cuda/include"
"-include"
>> "__clang_cuda_runtime_wrapper.h" "-O3"
"-fdeprecated-macro"
>> "-fno-dwarf-directory-asm"
"-fdebug-compilation-dir"
>> "/usr/local/google/home/jingyue/Work/cuda"
"-ferror-limit" "19"
>> "-fmessage-length" "205" "-pthread"
"-fobjc-runtime=gcc" "-fcxx-exceptions"
>> "-fexceptions" "-fdiagnostics-show-option"
"-fcolor-diagnostics"
>> "-vectorize-loops" "-vectorize-slp" "-o"
"/tmp/axpy-a88a72.s" "-x" "cuda" "
>> axpy.cu"
>>  "/usr/local/cuda/bin/ptxas" "-m64" "-O3"
"--gpu-name" "sm_35"
>> "--output-file" "/tmp/axpy-1dbca7.o"
"/tmp/axpy-a88a72.s"
>>  "/usr/local/cuda/bin/fatbinary" "--cuda"
"-64" "--create"
>> "/tmp/axpy-e6057c.fatbin"
"--image=profile=sm_35,file=/tmp/axpy-1dbca7.o"
>> "--image=profile=compute_35,file=/tmp/axpy-a88a72.s"
>> 
"/usr/local/google/home/jingyue/Work/llvm/install-git/bin/clang-3.7"
>> "-cc1" "-triple"
"x86_64-unknown-linux-gnu" "-aux-triple"
>> "nvptx64-nvidia-cuda" "-fcuda-target-overloads"
>> "-fcuda-disable-target-call-checks" "-emit-obj"
"-disable-free"
>> "-main-file-name" "axpy.cu"
"-mrelocation-model" "static"
>> "-mthread-model" "posix" "-fmath-errno"
"-masm-verbose"
>> "-mconstructor-aliases" "-munwind-tables"
"-fuse-init-array" "-target-cpu"
>> "x86-64" "-momit-leaf-frame-pointer"
"-dwarf-column-info"
>> "-debugger-tuning=gdb" "-resource-dir"
>>
"/usr/local/google/home/jingyue/Work/llvm/install-git/bin/../lib/clang/3.9.0"
>> "-I" "/usr/local/cuda/samples/common/inc"
"-internal-isystem"
>>
"/usr/lib/gcc/x86_64-linux-gnu/4.8/../../../../include/c++/4.8"
>> "-internal-isystem"
>>
"/usr/lib/gcc/x86_64-linux-gnu/4.8/../../../../include/x86_64-linux-gnu/c++/4.8"
>> "-internal-isystem"
>>
"/usr/lib/gcc/x86_64-linux-gnu/4.8/../../../../include/x86_64-linux-gnu/c++/4.8"
>> "-internal-isystem"
>>
"/usr/lib/gcc/x86_64-linux-gnu/4.8/../../../../include/c++/4.8/backward"
>> "-internal-isystem" "/usr/local/include"
"-internal-isystem"
>>
"/usr/local/google/home/jingyue/Work/llvm/install-git/bin/../lib/clang/3.9.0/include"
>> "-internal-externc-isystem"
"/usr/include/x86_64-linux-gnu"
>> "-internal-externc-isystem" "/include"
"-internal-externc-isystem"
>> "/usr/include" "-internal-isystem"
"/usr/local/cuda/include" "-include"
>> "__clang_cuda_runtime_wrapper.h" "-O3"
"-fdeprecated-macro"
>> "-fdebug-compilation-dir"
"/usr/local/google/home/jingyue/Work/cuda"
>> "-ferror-limit" "19" "-fmessage-length"
"205" "-pthread"
>> "-fobjc-runtime=gcc" "-fcxx-exceptions"
"-fexceptions"
>> "-fdiagnostics-show-option" "-fcolor-diagnostics"
"-vectorize-loops"
>> "-vectorize-slp" "-o"
"/tmp/axpy-48f6b5.o" "-x" "cuda"
"axpy.cu"
>> "-fcuda-include-gpubinary"
"/tmp/axpy-e6057c.fatbin"
>>  "/usr/bin/ld" "-z" "relro"
"--hash-style=gnu" "--build-id"
>> "--eh-frame-hdr" "-m" "elf_x86_64"
"-dynamic-linker"
>> "/lib64/ld-linux-x86-64.so.2" "-o"
"a.out"
>>
"/usr/lib/gcc/x86_64-linux-gnu/4.8/../../../x86_64-linux-gnu/crt1.o"
>>
"/usr/lib/gcc/x86_64-linux-gnu/4.8/../../../x86_64-linux-gnu/crti.o"
>> "/usr/lib/gcc/x86_64-linux-gnu/4.8/crtbegin.o"
"-L/usr/local/cuda/lib64"
>> "-L/usr/lib/gcc/x86_64-linux-gnu/4.8"
>>
"-L/usr/lib/gcc/x86_64-linux-gnu/4.8/../../../x86_64-linux-gnu"
>> "-L/lib/x86_64-linux-gnu" "-L/lib/../lib64"
"-L/usr/lib/x86_64-linux-gnu"
>> "-L/usr/lib/gcc/x86_64-linux-gnu/4.8/../../.."
>>
"-L/usr/local/google/home/jingyue/Work/llvm/install-git/bin/../lib"
>> "-L/lib" "-L/usr/lib"
"/tmp/axpy-48f6b5.o" "-lcudart_static" "-lcuda"
>> "-ldl" "-lrt" "-lstdc++" "-lm"
"-lgcc_s" "-lgcc" "-lpthread" "-lc"
>> "-lgcc_s" "-lgcc"
"/usr/lib/gcc/x86_64-linux-gnu/4.8/crtend.o"
>>
"/usr/lib/gcc/x86_64-linux-gnu/4.8/../../../x86_64-linux-gnu/crtn.o"
>>
>>
>>
>>>
>>> I apologize for the long email and I look forward to hearing from
you.
>>>
>>> Thanks!
>>> Yuanfeng Peng
>>>
>>
>>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160310/b59fccb7/attachment.html>

Yuanfeng Peng via llvm-dev

2016-Mar-11 01:21 UTC

head link

[llvm-dev] instrumenting device code with gpucc

Hi Jingyue,

Thank you so much for the helpful response!  I didn't know that PTX
assembly cannot be linked; that's likely the reason for my issue.

So I did the following as you suggested(axpy-sm_30.bc is the instrumented
bitcode, and cuda_hooks-sm_30.bc contains the hook functions):

*llvm-link axpy-sm_30.bc cuda_hooks-sm_30.bc  -o inst_axpy-sm_30.bc*

*llc inst_axpy-sm_30.bc -o axpy-sm_30.s*

*"/usr/local/cuda/bin/ptxas" "-m64" "-O3" -c
"--gpu-name" "sm_30"
"--output-file" axpy-sm_30.o axpy-sm_30.s*

However, I got the following error from ptxas:

*ptxas axpy-sm_30.s, line 106; error   : Duplicate definition of function
'_ZL21__nvvm_reflect_anchorv'*

*ptxas axpy-sm_30.s, line 106; fatal   : Parsing error near '.2': syntax
error*

*ptxas fatal   : Ptx assembly aborted due to errors*

Looks like some cuda function definitions are in both bitcode files which
caused duplicate definition... what am I supposed to do to resolve this
issue?


Thanks!

yuanfeng
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160310/bd9f20a5/attachment.html>

Jingyue Wu via llvm-dev

2016-Mar-11 11:47 UTC

head link

[llvm-dev] instrumenting device code with gpucc

Looks like we are getting closer!

On Thu, Mar 10, 2016 at 5:21 PM, Yuanfeng Peng <yuanfeng.jack.peng at
gmail.com> wrote:
> Hi Jingyue,
>
> Thank you so much for the helpful response!  I didn't know that PTX
> assembly cannot be linked; that's likely the reason for my issue.
>
> So I did the following as you suggested(axpy-sm_30.bc is the instrumented
> bitcode, and cuda_hooks-sm_30.bc contains the hook functions):
>
> *llvm-link axpy-sm_30.bc cuda_hooks-sm_30.bc  -o inst_axpy-sm_30.bc*
>
> *llc inst_axpy-sm_30.bc -o axpy-sm_30.s*
>
> *"/usr/local/cuda/bin/ptxas" "-m64" "-O3" -c
"--gpu-name" "sm_30"
> "--output-file" axpy-sm_30.o axpy-sm_30.s*
>
> However, I got the following error from ptxas:
>
> *ptxas axpy-sm_30.s, line 106; error   : Duplicate definition of function
> '_ZL21__nvvm_reflect_anchorv'*
>
> *ptxas axpy-sm_30.s, line 106; fatal   : Parsing error near '.2':
syntax
> error*
>
> *ptxas fatal   : Ptx assembly aborted due to errors*
>
> Looks like some cuda function definitions are in both bitcode files which
> caused duplicate definition... what am I supposed to do to resolve this
> issue?
>Can you attach axpy-sm_30.ll and cuda_hooks-sm_30.ll? The duplication may
be caused by how nvvm reflection works, but I'd like to see a concrete
example.
>
> Thanks!
>
> yuanfeng
>
>
>
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160311/77502182/attachment.html>

Jingyue Wu via llvm-dev

2016-Mar-11 18:10 UTC

head link

[llvm-dev] instrumenting device code with gpucc

According to the examples you sent, I believe the linking issue was caused
by nvvm reflection anchors. I haven't played with that, but I guess running
nvvm-reflect on an IR removes the nvvm reflect anchors. After that, you can
llvm-link the two bc/ll files.

Another potential issue is that your cuda_hooks-sm_30.ll is unoptimized.
This could cause the instrumented code to run super slow.

On Fri, Mar 11, 2016 at 9:40 AM, Yuanfeng Peng <yuanfeng.jack.peng at
gmail.com> wrote:
> Hey Jingyue,
>
> Attached are the .ll files.  Thanks!
>
> yuanfeng
>
> On Fri, Mar 11, 2016 at 3:47 AM, Jingyue Wu <jingyue at google.com>
wrote:
>
>> Looks like we are getting closer!
>>
>> On Thu, Mar 10, 2016 at 5:21 PM, Yuanfeng Peng <
>> yuanfeng.jack.peng at gmail.com> wrote:
>>
>>> Hi Jingyue,
>>>
>>> Thank you so much for the helpful response!  I didn't know that
PTX
>>> assembly cannot be linked; that's likely the reason for my
issue.
>>>
>>> So I did the following as you suggested(axpy-sm_30.bc is the
>>> instrumented bitcode, and cuda_hooks-sm_30.bc contains the hook
functions):
>>>
>>> *llvm-link axpy-sm_30.bc cuda_hooks-sm_30.bc  -o
inst_axpy-sm_30.bc*
>>>
>>> *llc inst_axpy-sm_30.bc -o axpy-sm_30.s*
>>>
>>> *"/usr/local/cuda/bin/ptxas" "-m64"
"-O3" -c "--gpu-name" "sm_30"
>>> "--output-file" axpy-sm_30.o axpy-sm_30.s*
>>>
>>> However, I got the following error from ptxas:
>>>
>>> *ptxas axpy-sm_30.s, line 106; error   : Duplicate definition of
>>> function '_ZL21__nvvm_reflect_anchorv'*
>>>
>>> *ptxas axpy-sm_30.s, line 106; fatal   : Parsing error near
'.2': syntax
>>> error*
>>>
>>> *ptxas fatal   : Ptx assembly aborted due to errors*
>>>
>>> Looks like some cuda function definitions are in both bitcode files
>>> which caused duplicate definition... what am I supposed to do to
resolve
>>> this issue?
>>>
>> Can you attach axpy-sm_30.ll and cuda_hooks-sm_30.ll? The duplication
may
>> be caused by how nvvm reflection works, but I'd like to see a
concrete
>> example.
>>
>>>
>>> Thanks!
>>>
>>> yuanfeng
>>>
>>>
>>>
>>>
>>>
>>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160311/00cfd504/attachment-0001.html>

Yuanfeng Peng via llvm-dev

2016-Mar-12 00:56 UTC

head link

[llvm-dev] instrumenting device code with gpucc

Hey Jingyue,

Though I tried `opt -nvvm-reflect` on both bc files, the nvvm reflect
anchor didn't go away; ptxas is still complaining about the duplicate
definition of of function '_ZL21__nvvm_reflect_anchorv' .  Did I misused
the nvvm-reflect pass?

Thanks!
yuanfeng

On Fri, Mar 11, 2016 at 10:10 AM, Jingyue Wu <jingyue at google.com>
wrote:
> According to the examples you sent, I believe the linking issue was caused
> by nvvm reflection anchors. I haven't played with that, but I guess
running
> nvvm-reflect on an IR removes the nvvm reflect anchors. After that, you can
> llvm-link the two bc/ll files.
>
> Another potential issue is that your cuda_hooks-sm_30.ll is unoptimized.
> This could cause the instrumented code to run super slow.
>
> On Fri, Mar 11, 2016 at 9:40 AM, Yuanfeng Peng <
> yuanfeng.jack.peng at gmail.com> wrote:
>
>> Hey Jingyue,
>>
>> Attached are the .ll files.  Thanks!
>>
>> yuanfeng
>>
>> On Fri, Mar 11, 2016 at 3:47 AM, Jingyue Wu <jingyue at
google.com> wrote:
>>
>>> Looks like we are getting closer!
>>>
>>> On Thu, Mar 10, 2016 at 5:21 PM, Yuanfeng Peng <
>>> yuanfeng.jack.peng at gmail.com> wrote:
>>>
>>>> Hi Jingyue,
>>>>
>>>> Thank you so much for the helpful response!  I didn't know
that PTX
>>>> assembly cannot be linked; that's likely the reason for my
issue.
>>>>
>>>> So I did the following as you suggested(axpy-sm_30.bc is the
>>>> instrumented bitcode, and cuda_hooks-sm_30.bc contains the hook
functions):
>>>>
>>>> *llvm-link axpy-sm_30.bc cuda_hooks-sm_30.bc  -o
inst_axpy-sm_30.bc*
>>>>
>>>> *llc inst_axpy-sm_30.bc -o axpy-sm_30.s*
>>>>
>>>> *"/usr/local/cuda/bin/ptxas" "-m64"
"-O3" -c "--gpu-name" "sm_30"
>>>> "--output-file" axpy-sm_30.o axpy-sm_30.s*
>>>>
>>>> However, I got the following error from ptxas:
>>>>
>>>> *ptxas axpy-sm_30.s, line 106; error   : Duplicate definition
of
>>>> function '_ZL21__nvvm_reflect_anchorv'*
>>>>
>>>> *ptxas axpy-sm_30.s, line 106; fatal   : Parsing error near
'.2':
>>>> syntax error*
>>>>
>>>> *ptxas fatal   : Ptx assembly aborted due to errors*
>>>>
>>>> Looks like some cuda function definitions are in both bitcode
files
>>>> which caused duplicate definition... what am I supposed to do
to resolve
>>>> this issue?
>>>>
>>> Can you attach axpy-sm_30.ll and cuda_hooks-sm_30.ll? The
duplication
>>> may be caused by how nvvm reflection works, but I'd like to see
a concrete
>>> example.
>>>
>>>>
>>>> Thanks!
>>>>
>>>> yuanfeng
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160311/4382a950/attachment.html>

Seemingly Similar Threads

Search for more apparently analagous threads

llvm dev - Mar 2016 - instrumenting device code with gpucc

[llvm-dev] instrumenting device code with gpucc

[llvm-dev] instrumenting device code with gpucc

[llvm-dev] instrumenting device code with gpucc

[llvm-dev] instrumenting device code with gpucc

[llvm-dev] instrumenting device code with gpucc

Seemingly Similar Threads