thr3ads.net - llvm dev - [llvm-dev] OrcJIT + CUDA Prototype for Cling [Nov 2017]

If this information is useful, please help other people find it:
Share via:

Simeon Ehrig via llvm-dev

2017-Sep-27 17:32 UTC

[llvm-dev] OrcJIT + CUDA Prototype for Cling

Dear LLVM-Developers and Vinod Grover,

we are trying to extend the cling C++ interpreter
(https://github.com/root-project/cling) with CUDA functionality for
Nvidia GPUs.

I already developed a prototype based on OrcJIT and am seeking for
feedback. I am currently a stuck with a runtime issue, on which my
interpreter prototype fails to execute kernels with a CUDA runtime error.


=== How to use the prototype

This application interprets cuda runtime code. The program needs the
whole cuda-program (.cu-file) and its pre-compiled device code (as
fatbin) as an input:

    command: cuda-interpreter [source].cu [kernels].fatbin

I also implemented an alternative mode, which is generating an object
file. The object file can be linked (ld) to an exectuable. This mode is
just implemented to check if the LLVM module generation works as
expected. Activate it by changing the define INTERPRET from 1 to 0 .

=== Implementation

The prototype is based on the clang example in

https://github.com/llvm-mirror/clang/tree/master/examples/clang-interpreter

I also pushed the source code to github with the install instructions
and examples:
  https://github.com/SimeonEhrig/CUDA-Runtime-Interpreter

The device code generation can be performed with either clang's CUDA
frontend or NVCC to ptx.

Here is the workflow in five stages:

 1. generating ptx device code (a kind of nvidia assembler)
 2. translate ptx to sass (machine code of ptx)
 3. generate a fatbinray (a kind of wrapper for the device code)
 4. generate host code object file (use fatbinary as input)
 5. link to executable

(The exact commands are stored in the commands.txt in the github repo)

The interpreter replaces the 4th and 5th step. It interprets the host
code with pre-compiled device code as fatbinary. The fatbinary (Step 1
to 3) will be generated with the clang compiler and the nvidia tools
ptxas and fatbinary.

=== Test Cases and Issues

You find the test sources on GitHub in the directory "example_prog".

Run the tests with cuda-interpeter and the two arguments as above:

 [1] path to the source code in "example_prog"
     - note: even for host-only code, use the file-ending .cu
    
 [2] path to the runtime .fatbin
     - note: needs the file ending .fatbin
     - a fatbin file is necessary, but if the program doesn't need a
kernel, the content of the file will ignore

Note: As a prototype, the input is just static and barely checked yet.

1. hello.cu: simple c++ hello world program with cmath library call
sqrt() -> works without problems

2. pthread_test.cu: c++ program, which starts a second thread -> works
without problems

3. fat_memory.cu: use cuda library and allocate about 191 MB of VRAM.
After the allocation, the program waits for 3 seconds, so you can check
the memory usage with the nvidia-smi -> works without problems

4. runtime.cu: combine cuda library with a simple cuda kernel ->
Generating an object file, which can be linked (see 5th call in commands
above -> ld ...) to a working executable.

The last example has the following issues: Running the executable works
fine. Interpreting the code instead does not work. The Cuda Runtime
returns the error 8 (cudaErrorInvalidDeviceFunction) , the kernel failed.

Do you have any idea how to proceed?


Best regards,
Simeon Ehrig
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170927/834cd020/attachment.html>

Lang Hames via llvm-dev

2017-Nov-08 23:59 UTC

head link

[llvm-dev] OrcJIT + CUDA Prototype for Cling

Hi Simon,

I think the best thing would be to add an ObjectTransformLayer between your
CompileLayer and LinkingLayer so that you can capture the object files as
they're generated. Then you can inspect the object files being generated by
the compiler to see what might be wrong with them.

Something like this:

class KaleidoscopeJIT {
private:

  using ObjectPtr
std::shared_ptr<object::OwningBinary<object::ObjectFile>>;

  static ObjectPtr dumpObject(ObjectPtr Obj) {
    SmallVector<char, 256> UniqueObjFileName;
    sys::fs::createUniqueFile("jit-object-%%%.o", UniqueObjFileName);
    std::error_code EC;
    raw_fd_ostream ObjFileStream(UniqueObjFileName.data(), EC,
sys::fs::F_RW);
    ObjFileStream.write(Obj->getBinary()->getData().data(),





                        Obj->getBinary()->getData().size());





    return Obj;
  }

  std::unique_ptr<TargetMachine> TM;
  const DataLayout DL;
  RTDyldObjectLinkingLayer ObjectLayer;
  ObjectTransformLayer<decltype(ObjectLayer),
                       decltype(&KaleidoscopeJIT::dumpObject)>
DumpObjectsLayer;
  IRCompileLayer<decltype(DumpObjectsLayer), SimpleCompiler> CompileLayer;

public:
  using ModuleHandle = decltype(CompileLayer)::ModuleHandleT;

  KaleidoscopeJIT()
      : TM(EngineBuilder().selectTarget()), DL(TM->createDataLayout()),
        ObjectLayer([]() { return
std::make_shared<SectionMemoryManager>();
}),
        DumpObjectsLayer(ObjectLayer, &KaleidoscopeJIT::dumpObject),
        CompileLayer(DumpObjectsLayer, SimpleCompiler(*TM)) {
    llvm::sys::DynamicLibrary::LoadLibraryPermanently(nullptr);
  }

Hope this helps!

Cheers,
Lang.


On Wed, Sep 27, 2017 at 10:32 AM, Simeon Ehrig via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
> Dear LLVM-Developers and Vinod Grover,
>
> we are trying to extend the cling C++ interpreter (
> https://github.com/root-project/cling) with CUDA functionality for Nvidia
> GPUs.
>
> I already developed a prototype based on OrcJIT and am seeking for
> feedback. I am currently a stuck with a runtime issue, on which my
> interpreter prototype fails to execute kernels with a CUDA runtime error.
>
>
> === How to use the prototype
>
> This application interprets cuda runtime code. The program needs the whole
> cuda-program (.cu-file) and its pre-compiled device code (as fatbin) as an
> input:
>
>     command: cuda-interpreter [source].cu [kernels].fatbin
>
> I also implemented an alternative mode, which is generating an object
> file. The object file can be linked (ld) to an exectuable. This mode is
> just implemented to check if the LLVM module generation works as expected.
> Activate it by changing the define INTERPRET from 1 to 0 .
>
> === Implementation
>
> The prototype is based on the clang example in
>
> https://github.com/llvm-mirror/clang/tree/master/
> examples/clang-interpreter
>
> I also pushed the source code to github with the install instructions and
> examples:
>   https://github.com/SimeonEhrig/CUDA-Runtime-Interpreter
>
> The device code generation can be performed with either clang's CUDA
> frontend or NVCC to ptx.
>
> Here is the workflow in five stages:
>
>    1. generating ptx device code (a kind of nvidia assembler)
>    2. translate ptx to sass (machine code of ptx)
>    3. generate a fatbinray (a kind of wrapper for the device code)
>    4. generate host code object file (use fatbinary as input)
>    5. link to executable
>
> (The exact commands are stored in the commands.txt in the github repo)
>
> The interpreter replaces the 4th and 5th step. It interprets the host code
> with pre-compiled device code as fatbinary. The fatbinary (Step 1 to 3)
> will be generated with the clang compiler and the nvidia tools ptxas and
> fatbinary.
>
> === Test Cases and Issues
>
> You find the test sources on GitHub in the directory
"example_prog".
>
> Run the tests with cuda-interpeter and the two arguments as above:
>
>  [1] path to the source code in "example_prog"
>      - note: even for host-only code, use the file-ending .cu
>
>  [2] path to the runtime .fatbin
>      - note: needs the file ending .fatbin
>      - a fatbin file is necessary, but if the program doesn't need a
> kernel, the content of the file will ignore
> Note: As a prototype, the input is just static and barely checked yet.
>
> 1. hello.cu: simple c++ hello world program with cmath library call
> sqrt() -> works without problems
>
> 2. pthread_test.cu: c++ program, which starts a second thread -> works
> without problems
>
> 3. fat_memory.cu: use cuda library and allocate about 191 MB of VRAM.
> After the allocation, the program waits for 3 seconds, so you can check the
> memory usage with the nvidia-smi -> works without problems
>
> 4. runtime.cu: combine cuda library with a simple cuda kernel ->
> Generating an object file, which can be linked (see 5th call in commands
> above -> ld ...) to a working executable.
>
> The last example has the following issues: Running the executable works
> fine. Interpreting the code instead does not work. The Cuda Runtime returns
> the error 8 (cudaErrorInvalidDeviceFunction) , the kernel failed.
>
> Do you have any idea how to proceed?
>
>
> Best regards,
> Simeon Ehrig
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20171108/05ae9921/attachment.html>

Simeon Ehrig via llvm-dev

2017-Nov-14 21:15 UTC

head link

[llvm-dev] OrcJIT + CUDA Prototype for Cling

Hi Lang,

thank You very much. I've used Your code and the creating of the object
file works. I think the problem is after creating the object file. When
I link the object file with ld I get an executable, which is working right.

After changing the clang and llvm libraries from the package control
version (.deb) to a own compiled version with debug options, I get an
assert() fault.
In
void RuntimeDyldELF::resolveX86_64Relocation() at the case
ELF::R_X86_64_PC32
this will throw an assert. You can find the code in the file
llvm/lib/ExecutionEngine/RuntimeDyld/RuntimeDyldELF.cpp . I don't know
exactly, what this function do but after first research, I think it has
something to do with the linking. Maybe You know more about the function?

Your code also helps me to understand more, how the interpreter library
works. I have also some new ideas, how I could find the concrete problem
and solve it.

Cheers,
Simeon

Am 09.11.2017 um 00:59 schrieb Lang Hames:> Hi Simon,
>
> I think the best thing would be to add an ObjectTransformLayer between
> your CompileLayer and LinkingLayer so that you can capture the object
> files as they're generated. Then you can inspect the object files
> being generated by the compiler to see what might be wrong with them.
>
> Something like this:
>
> class KaleidoscopeJIT {
> private:
>
>   using ObjectPtr >
std::shared_ptr<object::OwningBinary<object::ObjectFile>>;
>
>   static ObjectPtr dumpObject(ObjectPtr Obj) {
>     SmallVector<char, 256> UniqueObjFileName;
>     sys::fs::createUniqueFile("jit-object-%%%.o",
UniqueObjFileName);
>     std::error_code EC;
>     raw_fd_ostream ObjFileStream(UniqueObjFileName.data(), EC,
> sys::fs::F_RW);
>     ObjFileStream.write(Obj->getBinary()->getData().data(),          
>                                                                      
>                                                                      
>                                                                      
>                                                                      
>                                       
>                         Obj->getBinary()->getData().size());        
 
>                                                                      
>                                                                      
>                                                                      
>                                                                      
>                                      
>     return Obj;
>   }
>
>   std::unique_ptr<TargetMachine> TM;
>   const DataLayout DL;
>   RTDyldObjectLinkingLayer ObjectLayer;
>   ObjectTransformLayer<decltype(ObjectLayer),
>                        decltype(&KaleidoscopeJIT::dumpObject)>
> DumpObjectsLayer;
>   IRCompileLayer<decltype(DumpObjectsLayer), SimpleCompiler>
CompileLayer;
>
> public:
>   using ModuleHandle = decltype(CompileLayer)::ModuleHandleT;
>
>   KaleidoscopeJIT()
>       : TM(EngineBuilder().selectTarget()), DL(TM->createDataLayout()),
>         ObjectLayer([]() { return
> std::make_shared<SectionMemoryManager>(); }),
>         DumpObjectsLayer(ObjectLayer, &KaleidoscopeJIT::dumpObject),
>         CompileLayer(DumpObjectsLayer, SimpleCompiler(*TM)) {
>     llvm::sys::DynamicLibrary::LoadLibraryPermanently(nullptr);
>   }
>
> Hope this helps!
>
> Cheers,
> Lang.
>
>
> On Wed, Sep 27, 2017 at 10:32 AM, Simeon Ehrig via llvm-dev
> <llvm-dev at lists.llvm.org <mailto:llvm-dev at
lists.llvm.org>> wrote:
>
>     Dear LLVM-Developers and Vinod Grover,
>
>     we are trying to extend the cling C++ interpreter
>     (https://github.com/root-project/cling
>     <https://github.com/root-project/cling>) with CUDA functionality
>     for Nvidia GPUs.
>
>     I already developed a prototype based on OrcJIT and am seeking for
>     feedback. I am currently a stuck with a runtime issue, on which my
>     interpreter prototype fails to execute kernels with a CUDA runtime
>     error.
>
>
>     === How to use the prototype
>
>     This application interprets cuda runtime code. The program needs
>     the whole cuda-program (.cu-file) and its pre-compiled device code
>     (as fatbin) as an input:
>
>         command: cuda-interpreter [source].cu [kernels].fatbin
>
>     I also implemented an alternative mode, which is generating an
>     object file. The object file can be linked (ld) to an exectuable.
>     This mode is just implemented to check if the LLVM module
>     generation works as expected. Activate it by changing the define
>     INTERPRET from 1 to 0 .
>
>     === Implementation
>
>     The prototype is based on the clang example in
>
>    
https://github.com/llvm-mirror/clang/tree/master/examples/clang-interpreter
>    
<https://github.com/llvm-mirror/clang/tree/master/examples/clang-interpreter>
>
>     I also pushed the source code to github with the install
>     instructions and examples:
>       https://github.com/SimeonEhrig/CUDA-Runtime-Interpreter
>     <https://github.com/SimeonEhrig/CUDA-Runtime-Interpreter>
>
>     The device code generation can be performed with either clang's
>     CUDA frontend or NVCC to ptx.
>
>     Here is the workflow in five stages:
>
>      1. generating ptx device code (a kind of nvidia assembler)
>      2. translate ptx to sass (machine code of ptx)
>      3. generate a fatbinray (a kind of wrapper for the device code)
>      4. generate host code object file (use fatbinary as input)
>      5. link to executable
>
>     (The exact commands are stored in the commands.txt in the github repo)
>
>     The interpreter replaces the 4th and 5th step. It interprets the
>     host code with pre-compiled device code as fatbinary. The
>     fatbinary (Step 1 to 3) will be generated with the clang compiler
>     and the nvidia tools ptxas and fatbinary.
>
>     === Test Cases and Issues
>
>     You find the test sources on GitHub in the directory
"example_prog".
>
>     Run the tests with cuda-interpeter and the two arguments as above:
>
>      [1] path to the source code in "example_prog"
>          - note: even for host-only code, use the file-ending .cu
>         
>      [2] path to the runtime .fatbin
>          - note: needs the file ending .fatbin
>          - a fatbin file is necessary, but if the program doesn't need
>     a kernel, the content of the file will ignore
>
>     Note: As a prototype, the input is just static and barely checked yet.
>
>     1. hello.cu <http://hello.cu>: simple c++ hello world program
with
>     cmath library call sqrt() -> works without problems
>
>     2. pthread_test.cu <http://pthread_test.cu>: c++ program, which
>     starts a second thread -> works without problems
>
>     3. fat_memory.cu <http://fat_memory.cu>: use cuda library and
>     allocate about 191 MB of VRAM. After the allocation, the program
>     waits for 3 seconds, so you can check the memory usage with the
>     nvidia-smi -> works without problems
>
>     4. runtime.cu <http://runtime.cu>: combine cuda library with a
>     simple cuda kernel -> Generating an object file, which can be
>     linked (see 5th call in commands above -> ld ...) to a working
>     executable.
>
>     The last example has the following issues: Running the executable
>     works fine. Interpreting the code instead does not work. The Cuda
>     Runtime returns the error 8 (cudaErrorInvalidDeviceFunction) , the
>     kernel failed.
>
>     Do you have any idea how to proceed?
>
>
>     Best regards,
>     Simeon Ehrig
>
>     _______________________________________________
>     LLVM Developers mailing list
>     llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>
>     http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>     <http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20171114/f01930a8/attachment.html>

llvm dev - Nov 2017 - OrcJIT + CUDA Prototype for Cling

[llvm-dev] OrcJIT + CUDA Prototype for Cling

[llvm-dev] OrcJIT + CUDA Prototype for Cling

[llvm-dev] OrcJIT + CUDA Prototype for Cling