thr3ads.net - llvm dev - [llvm-dev] NVPTX Back-end: relocatable device code support for dynamic parallelism [Jun 2017]

If this information is useful, please help other people find it:
Share via:

Lorenz Braun via llvm-dev

2017-Jun-09 11:31 UTC

[llvm-dev] NVPTX Back-end: relocatable device code support for dynamic parallelism

Hi everyone,

CUDA allows to call some runtime functions also from the device code. On 
a multi-GPU system this allows the GPU to determine its device id on its 
own via cudaGetDevice().
Unfortunately i cannot get it working when compiling with clang. When 
compiling with nvcc relocatable device code needs to be set to true 
(-rdc=true) and the cudadevrt is needed when linking [0]. I did not 
found such switches to turn rdc for clang. Just compiling does not work 
as ptxas does not find the function cudaGetDevice().

My guess is, that this feature is not supported. Does anyone know is 
this is the case?

I also tried to find out what nvcc is doing when setting rdc to on, but 
hat a few problem trying to understand whats going on. I will attach the 
verbose output of nvcc. I
have no clue what the binaries cudafe/cudafe++ and cicc are doing so its 
rather hard to guess whats happening.
There are additional options like -D__CUDACC_RDC__, --device-c and 
--compile-only that are not used when rdc is off. All but --device-c can 
be used with clang and i can compile my program, however i can't get it 
to run properly.  For each runtime call i get an unknown error with code 30.

I have few hope, that someone already has figured out how to use get rdc 
to work with clang, but i will be grateful for any hint. To whom could i 
write to regarding this problem? Maybe the NVPTX developers can help?

[0] 
http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#toolkit-support-for-dynamic-parallelism

-- 
Lorenz Braun
Research Associate
Institute of Computer Engineering (ZITI)
B6, 26, Building B, Office B2.20
68131 Mannheim

Phone: +49-621-181-2696
lorenz.braun at ziti.uni-heidelberg.de

-------------- next part --------------
#$ _SPACE_#$ _CUDART_=cudart
#$ _HERE_=/opt/cuda-8.0/bin
#$ _THERE_=/opt/cuda-8.0/bin
#$ _TARGET_SIZE_#$ _TARGET_DIR_#$ _TARGET_SIZE_=64
#$ TOP=/opt/cuda-8.0/bin/..
#$ NVVMIR_LIBRARY_DIR=/opt/cuda-8.0/bin/../nvvm/libdevice
#$
LD_LIBRARY_PATH=/opt/cuda-8.0/bin/../lib:/opt/cuda-8.0/lib64:/opt/llvm-4.0/lib
#$
PATH=/opt/cuda-8.0/bin/../open64/bin:/opt/cuda-8.0/bin/../nvvm/bin:/opt/cuda-8.0/bin:/opt/cuda-8.0/bin:/opt/llvm-4.0/bin:/home/lbraun/.opt/local/bin:/home/lbraun/.local/bin:/home/lbraun/.local/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/home/lbraun/bin:/home/lbraun/bin
#$ INCLUDES="-I/opt/cuda-8.0/bin/..//include"
#$ LIBRARIES=  "-L/opt/cuda-8.0/bin/..//lib64/stubs"
"-L/opt/cuda-8.0/bin/..//lib64"
#$ CUDAFE_FLAGS#$ PTXAS_FLAGS#$ gcc -std=c++11 -D__CUDA_ARCH__=350 -E -x c++    
-DCUDA_DOUBLE_MATH_FUNCTIONS  -D__CUDACC__ -D__NVCC__ -D__CUDACC_RDC__ 
"-I/opt/cuda-8.0/bin/..//include"   -D"__CUDACC_VER__=80061"
-D"__CUDACC_VER_BUILD__=61" -D"__CUDACC_VER_MINOR__=0"
-D"__CUDACC_VER_MAJOR__=8" -include "cuda_runtime.h" -m64
"../testApps/cuda_id_test.cu" >
"/tmp/tmpxft_00007040_00000000-9_cuda_id_test.cpp1.ii"

#$ cudafe --allow_managed --m64 --gnu_version=40805 --c++11 -tused
--no_remove_unneeded_entities  --device-c --gen_c_file_name
"/tmp/tmpxft_00007040_00000000-4_cuda_id_test.cudafe1.c"
--stub_file_name
"/tmp/tmpxft_00007040_00000000-4_cuda_id_test.cudafe1.stub.c"
--gen_device_file_name
"/tmp/tmpxft_00007040_00000000-4_cuda_id_test.cudafe1.gpu" --nv_arch
"compute_35" --gen_module_id_file --module_id_file_name
"/tmp/tmpxft_00007040_00000000-3_cuda_id_test.module_id"
--include_file_name "tmpxft_00007040_00000000-2_cuda_id_test.fatbin.c"
"/tmp/tmpxft_00007040_00000000-9_cuda_id_test.cpp1.ii"

#$ gcc -std=c++11 -E -x c++ -D__CUDACC__ -D__NVCC__ -D__CUDACC_RDC__ 
"-I/opt/cuda-8.0/bin/..//include"   -D"__CUDACC_VER__=80061"
-D"__CUDACC_VER_BUILD__=61" -D"__CUDACC_VER_MINOR__=0"
-D"__CUDACC_VER_MAJOR__=8" -include "cuda_runtime.h" -m64
"../testApps/cuda_id_test.cu" >
"/tmp/tmpxft_00007040_00000000-5_cuda_id_test.cpp4.ii"

#$ cudafe++ --allow_managed --m64 --gnu_version=40805 --c++11 --parse_templates 
--device-c --gen_c_file_name
"/tmp/tmpxft_00007040_00000000-4_cuda_id_test.cudafe1.cpp"
--stub_file_name
"tmpxft_00007040_00000000-4_cuda_id_test.cudafe1.stub.c"
--module_id_file_name
"/tmp/tmpxft_00007040_00000000-3_cuda_id_test.module_id"
"/tmp/tmpxft_00007040_00000000-5_cuda_id_test.cpp4.ii"

#$ gcc -D__CUDA_ARCH__=350 -E -x c        -DCUDA_DOUBLE_MATH_FUNCTIONS 
-D__CUDACC__ -D__NVCC__ -D__CUDACC_RDC__ -D__CUDANVVM__  -D__CUDA_FTZ=0
-D__CUDA_PREC_DIV=1 -D__CUDA_PREC_SQRT=1
"-I/opt/cuda-8.0/bin/..//include"   -m64
"/tmp/tmpxft_00007040_00000000-4_cuda_id_test.cudafe1.gpu" >
"/tmp/tmpxft_00007040_00000000-11_cuda_id_test.cpp2.i"

#$ cudafe -w --allow_managed --m64 --gnu_version=40805 --c  --device-c
--gen_c_file_name
"/tmp/tmpxft_00007040_00000000-12_cuda_id_test.cudafe2.c"
--stub_file_name
"/tmp/tmpxft_00007040_00000000-12_cuda_id_test.cudafe2.stub.c"
--gen_device_file_name
"/tmp/tmpxft_00007040_00000000-12_cuda_id_test.cudafe2.gpu" --nv_arch
"compute_35" --module_id_file_name
"/tmp/tmpxft_00007040_00000000-3_cuda_id_test.module_id"
--include_file_name "tmpxft_00007040_00000000-2_cuda_id_test.fatbin.c"
"/tmp/tmpxft_00007040_00000000-11_cuda_id_test.cpp2.i"

#$ gcc -D__CUDA_ARCH__=350 -E -x c        -DCUDA_DOUBLE_MATH_FUNCTIONS 
-D__CUDABE__ -D__CUDANVVM__ -D__USE_FAST_MATH__=0  -D__CUDA_FTZ=0
-D__CUDA_PREC_DIV=1 -D__CUDA_PREC_SQRT=1
"-I/opt/cuda-8.0/bin/..//include"   -m64
"/tmp/tmpxft_00007040_00000000-12_cuda_id_test.cudafe2.gpu" >
"/tmp/tmpxft_00007040_00000000-13_cuda_id_test.cpp3.i"

#$ cicc  -arch compute_35 -m64 -ftz=0 -prec_div=1 -prec_sqrt=1 -fmad=1
-nvvmir-library
"/opt/cuda-8.0/bin/../nvvm/libdevice/libdevice.compute_35.10.bc" 
--device-c --orig_src_file_name "../testApps/cuda_id_test.cu" 
"/tmp/tmpxft_00007040_00000000-13_cuda_id_test.cpp3.i" -o
"/tmp/tmpxft_00007040_00000000-6_cuda_id_test.ptx"

#$ ptxas  -arch=sm_35 -m64 --compile-only 
"/tmp/tmpxft_00007040_00000000-6_cuda_id_test.ptx"  -o
"/tmp/tmpxft_00007040_00000000-14_cuda_id_test.sm_35.cubin"

#$ fatbinary
--create="/tmp/tmpxft_00007040_00000000-2_cuda_id_test.fatbin" -64
--cmdline="--compile-only  "
"--image=profile=sm_35,file=/tmp/tmpxft_00007040_00000000-14_cuda_id_test.sm_35.cubin"
"--image=profile=compute_35,file=/tmp/tmpxft_00007040_00000000-6_cuda_id_test.ptx"
--embedded-fatbin="/tmp/tmpxft_00007040_00000000-2_cuda_id_test.fatbin.c"
--cuda --device-c
#$ rm /tmp/tmpxft_00007040_00000000-2_cuda_id_test.fatbin

#$ gcc -std=c++11 -D__CUDA_ARCH__=350 -E -x c++       
-DCUDA_DOUBLE_MATH_FUNCTIONS  -D__USE_FAST_MATH__=0  -D__CUDA_FTZ=0
-D__CUDA_PREC_DIV=1 -D__CUDA_PREC_SQRT=1
"-I/opt/cuda-8.0/bin/..//include"   -m64
"/tmp/tmpxft_00007040_00000000-4_cuda_id_test.cudafe1.cpp" >
"/tmp/tmpxft_00007040_00000000-15_cuda_id_test.ii"
#$ gcc -std=c++11 -c -x c++ "-I/opt/cuda-8.0/bin/..//include"  
-fpreprocessed -m64 -o
"/tmp/tmpxft_00007040_00000000-16_cuda_id_test.o"
"/tmp/tmpxft_00007040_00000000-15_cuda_id_test.ii"
#$ nvlink --arch=sm_35
--register-link-binaries="/tmp/tmpxft_00007040_00000000-7_id_test_dlink.reg.c"
-m64   "-L/opt/cuda-8.0/bin/..//lib64/stubs"
"-L/opt/cuda-8.0/bin/..//lib64" -cpu-arch=X86_64
"/tmp/tmpxft_00007040_00000000-16_cuda_id_test.o"  -lcudadevrt  -o
"/tmp/tmpxft_00007040_00000000-17_id_test_dlink.sm_35.cubin"
#$ fatbinary
--create="/tmp/tmpxft_00007040_00000000-8_id_test_dlink.fatbin" -64
--cmdline="--compile-only  " -link
"--image=profile=sm_35,file=/tmp/tmpxft_00007040_00000000-17_id_test_dlink.sm_35.cubin"
--embedded-fatbin="/tmp/tmpxft_00007040_00000000-8_id_test_dlink.fatbin.c"
#$ rm /tmp/tmpxft_00007040_00000000-8_id_test_dlink.fatbin

#$ gcc -std=c++11 -c -x c++
-DFATBINFILE="\"/tmp/tmpxft_00007040_00000000-8_id_test_dlink.fatbin.c\""
-DREGISTERLINKBINARYFILE="\"/tmp/tmpxft_00007040_00000000-7_id_test_dlink.reg.c\""
-I. "-I/opt/cuda-8.0/bin/..//include"  
-D"__CUDACC_VER__=80061" -D"__CUDACC_VER_BUILD__=61"
-D"__CUDACC_VER_MINOR__=0" -D"__CUDACC_VER_MAJOR__=8" -m64
-o "/tmp/tmpxft_00007040_00000000-18_id_test_dlink.o"
"/opt/cuda-8.0/bin/crt/link.stub"
#$ g++ -m64 -o "id_test" -std=c++11 -Wl,--start-group
"/tmp/tmpxft_00007040_00000000-18_id_test_dlink.o"
"/tmp/tmpxft_00007040_00000000-16_cuda_id_test.o"  
"-L/opt/cuda-8.0/bin/..//lib64/stubs"
"-L/opt/cuda-8.0/bin/..//lib64" -lcudadevrt  -lcudart_static  -lrt
-lpthread  -ldl  -Wl,--end-group

Justin Lebar via llvm-dev

2017-Aug-25 20:10 UTC

head link

[llvm-dev] NVPTX Back-end: relocatable device code support for dynamic parallelism

Sorry for the long delay in replying, I'm not good at reading the mailing
list.
> My guess is, that this feature is not supported. Does anyone know is this
is the case?
Clang does not support dynamic parallelism or relocatable CUDA code
today.  Patches -- and documentation fixes to mention this at
https://llvm.org/docs/CompileCudaWithLLVM.html -- are certainly
welcome.

-Justin

On Fri, Jun 9, 2017 at 4:31 AM, Lorenz Braun via llvm-dev
<llvm-dev at lists.llvm.org> wrote:> Hi everyone,
>
> CUDA allows to call some runtime functions also from the device code. On a
> multi-GPU system this allows the GPU to determine its device id on its own
> via cudaGetDevice().
> Unfortunately i cannot get it working when compiling with clang. When
> compiling with nvcc relocatable device code needs to be set to true
> (-rdc=true) and the cudadevrt is needed when linking [0]. I did not found
> such switches to turn rdc for clang. Just compiling does not work as ptxas
> does not find the function cudaGetDevice().
>
> My guess is, that this feature is not supported. Does anyone know is this
is
> the case?
>
> I also tried to find out what nvcc is doing when setting rdc to on, but hat
> a few problem trying to understand whats going on. I will attach the
verbose
> output of nvcc. I
> have no clue what the binaries cudafe/cudafe++ and cicc are doing so its
> rather hard to guess whats happening.
> There are additional options like -D__CUDACC_RDC__, --device-c and
> --compile-only that are not used when rdc is off. All but --device-c can be
> used with clang and i can compile my program, however i can't get it to
run
> properly.  For each runtime call i get an unknown error with code 30.
>
> I have few hope, that someone already has figured out how to use get rdc to
> work with clang, but i will be grateful for any hint. To whom could i write
> to regarding this problem? Maybe the NVPTX developers can help?
>
> [0]
>
http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#toolkit-support-for-dynamic-parallelism
>
> --
> Lorenz Braun
> Research Associate
> Institute of Computer Engineering (ZITI)
> B6, 26, Building B, Office B2.20
> 68131 Mannheim
>
> Phone: +49-621-181-2696
> lorenz.braun at ziti.uni-heidelberg.de
>
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>

Maybe Matching Threads

Search for more reasonably related threads

llvm dev - Jun 2017 - NVPTX Back-end: relocatable device code support for dynamic parallelism

[llvm-dev] NVPTX Back-end: relocatable device code support for dynamic parallelism

[llvm-dev] NVPTX Back-end: relocatable device code support for dynamic parallelism

Maybe Matching Threads