thr3ads.net - llvm dev - [LLVMdev] GSoC 2012 Proposal: Automatic GPGPU code generation for llvm [Apr 2012]

If this information is useful, please help other people find it:
Share via:

Tobias Grosser

2012-Apr-04 11:49 UTC

[LLVMdev] GSoC 2012 Proposal: Automatic GPGPU code generation for llvm

On 04/03/2012 03:13 PM, Hongbin Zheng wrote:> Hi Yabin,
>
> Instead of compile the LLVM IR to PTX asm string in a ScopPass, you
> can also the improve llc/lli or create new tools to support the code
> generation for Heterogeneous platforms[1], i.e. generate code for more
> than one target architecture at the same time. Something like this is
> not very complicated and had been implemented[2,3] by some people, but
> not available in LLVM mainstream. Implement this could make your GPU
> project more complete.
I agree with ether that we should ensure as much work as possible is 
done within generic, not Polly specific code.

In terms of heterogeneous code generation the approach Yabin proposed 
seems to work, but we should discuss other approaches. For the moment,
I believe his proposal is very similar the model of OpenCL and CUDA. He 
splits the code into host and kernel code. The host code is directly 
compiled to machine code by the existing tools (clang/llc). The kernel 
code is stored as a string and only at execution time it is compiled to 
platform specific code.

Are there any other approaches that could be taken? What specific 
heterogeneous platform support would be needed. At the moment, it seems 
to me we actually do not need too much additional support.

Cheers
Tobi

Justin Holewinski

2012-Apr-04 14:17 UTC

head link

[LLVMdev] GSoC 2012 Proposal: Automatic GPGPU code generation for llvm

On Wed, Apr 4, 2012 at 4:49 AM, Tobias Grosser <tobias at grosser.es>
wrote:
> On 04/03/2012 03:13 PM, Hongbin Zheng wrote:
> > Hi Yabin,
> >
> > Instead of compile the LLVM IR to PTX asm string in a ScopPass, you
> > can also the improve llc/lli or create new tools to support the code
> > generation for Heterogeneous platforms[1], i.e. generate code for more
> > than one target architecture at the same time. Something like this is
> > not very complicated and had been implemented[2,3] by some people, but
> > not available in LLVM mainstream. Implement this could make your GPU
> > project more complete.
>
> I agree with ether that we should ensure as much work as possible is
> done within generic, not Polly specific code.
>
Right, this has the potential to impact more people that the users of
Polly. By moving as much as possible to generic LLVM, that infrastructure
can be leveraged by people doing work outside of the polyhedral model.

>
> In terms of heterogeneous code generation the approach Yabin proposed
> seems to work, but we should discuss other approaches. For the moment,
> I believe his proposal is very similar the model of OpenCL and CUDA. He
> splits the code into host and kernel code. The host code is directly
> compiled to machine code by the existing tools (clang/llc). The kernel
> code is stored as a string and only at execution time it is compiled to
> platform specific code.
>
Depending on your target, that may be the only way.  If your target is
OpenCL-compatible accelerators, then your only portable option is save the
kernel code as OpenCL text and let the driver JIT compiler it at run-time.
 Any other approach is not guaranteed to be compatible across platforms or
even driver versions.

In this case, the target is the CUDA Driver API, so you're free to pass
along any valid PTX assembly.  In this case, you still pass the PTX code as
a string to the driver, which JIT compiles it to actual GPU device code at
run-time.

>
> Are there any other approaches that could be taken? What specific
> heterogeneous platform support would be needed. At the moment, it seems
> to me we actually do not need too much additional support.
>
I could see this working without any additional support, if needed.  It
seems like this proposal is dealing with LLVM IR -> LLVM IR code
generation, so the only thing that is really needed is a way to split the
IR into multiple separate IRs (one for host, and one for each accelerator
target).  This does not really need any supporting infrastructure, as you
could imagine an opt pass processing the input IR and transforming it to
the host IR, and emitting the device IR as a separate module.

Now if you're talking about source-level support for heterogeneous
platforms (e.g. C++ AMP), then you would need to adapt Clang to support
emission of multiple IR modules.  Basically, the AST would need to be split
into host and device portions, and codegen'd appropriately.  I feel that is
far beyond the scope of this proposal, though.

>
> Cheers
> Tobi
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>


-- 

Thanks,

Justin Holewinski
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20120404/c72ac3d6/attachment.html>

Tobias Grosser

2012-Apr-04 14:35 UTC

head link

[LLVMdev] GSoC 2012 Proposal: Automatic GPGPU code generation for llvm

On 04/04/2012 04:17 PM, Justin Holewinski wrote:>
>
> On Wed, Apr 4, 2012 at 4:49 AM, Tobias Grosser <tobias at grosser.es
> <mailto:tobias at grosser.es>> wrote:
>
>     On 04/03/2012 03:13 PM, Hongbin Zheng wrote:
>      > Hi Yabin,
>      >
>      > Instead of compile the LLVM IR to PTX asm string in a ScopPass,
you
>      > can also the improve llc/lli or create new tools to support the
code
>      > generation for Heterogeneous platforms[1], i.e. generate code for
>     more
>      > than one target architecture at the same time. Something like
this is
>      > not very complicated and had been implemented[2,3] by some
>     people, but
>      > not available in LLVM mainstream. Implement this could make your
GPU
>      > project more complete.
>
>     I agree with ether that we should ensure as much work as possible is
>     done within generic, not Polly specific code.
>
>
> Right, this has the potential to impact more people that the users of
> Polly. By moving as much as possible to generic LLVM, that
> infrastructure can be leveraged by people doing work outside of the
> polyhedral model.
To make stuff generic it is often helpful to know the other possible use 
cases. I consequently encourage everybody to point out such use cases or 
to state which exact functionality they might want to reuse. Otherwise, 
there it may happen that we focus a little too much on the needs of Polly.
>     In terms of heterogeneous code generation the approach Yabin proposed
>     seems to work, but we should discuss other approaches. For the moment,
>     I believe his proposal is very similar the model of OpenCL and CUDA. He
>     splits the code into host and kernel code. The host code is directly
>     compiled to machine code by the existing tools (clang/llc). The kernel
>     code is stored as a string and only at execution time it is compiled to
>     platform specific code.
>
>
> Depending on your target, that may be the only way.  If your target is
> OpenCL-compatible accelerators, then your only portable option is save
> the kernel code as OpenCL text and let the driver JIT compiler it at
> run-time.  Any other approach is not guaranteed to be compatible across
> platforms or even driver versions.
> In this case, the target is the CUDA Driver API, so you're free to pass
> along any valid PTX assembly.  In this case, you still pass the PTX code
> as a string to the driver, which JIT compiles it to actual GPU device
> code at run-time.
I would like to highlight that with the word 'string' I was not 
referring to 'OpenCL C code'. I don't think it is a practical
approach
to recover OpenCL C code, especially as the LLVM-IR C backend was 
recently removed.

I meant to describe that the kernel code is stored as a global variable 
in the host binary (in some intermediate representation such as LLVM-IR, 
PTX or a vendor specific OpenCLBinary) and is loaded at execution time 
into the OpenCL or CUDA runtime, where it is compiled down to hardware 
specific machine code.
>     Are there any other approaches that could be taken? What specific
>     heterogeneous platform support would be needed. At the moment, it seems
>     to me we actually do not need too much additional support.
>
>
> I could see this working without any additional support, if needed.  It
> seems like this proposal is dealing with LLVM IR -> LLVM IR code
> generation, so the only thing that is really needed is a way to split
> the IR into multiple separate IRs (one for host, and one for each
> accelerator target).  This does not really need any supporting
> infrastructure, as you could imagine an opt pass processing the input IR
> and transforming it to the host IR, and emitting the device IR as a
> separate module.
Yes. And instead of saving the two modules in separate files, we can 
store the kernel modul as a 'string' in the host module and add the 
necessary library calls to load it at run time. This will give a smooth 
user experience and requires almost no additional infrastructure.

(At the moment this will only work with NVidia, but I am confident there 
will be OpenCL vendor extensions that allow loading LLVM-IR kernels. AMD 
OpenCL can e.g. load LLVM-IR, even though it is not officially supported)
> Now if you're talking about source-level support for heterogeneous
> platforms (e.g. C++ AMP), then you would need to adapt Clang to support
> emission of multiple IR modules.  Basically, the AST would need to be
> split into host and device portions, and codegen'd appropriately.  I
> feel that is far beyond the scope of this proposal, though.
Yes. No source level transformations or targeting anything else than 
PTX, AMDIL or LLVM-IR.

Cheers
Tobi

Hongbin Zheng

2012-Apr-04 16:48 UTC

head link

[LLVMdev] Fwd: GSoC 2012 Proposal: Automatic GPGPU code generation for llvm

oops, forget to cc the dev-list

hi tobi,
>
>
> Yes. And instead of saving the two modules in separate files, we can store
> the kernel modul as a 'string' in the host module and add the
necessary
> library calls to load it at run time. This will give a smooth user
> experience and requires almost no additional infrastructure.We may lost some co-optimization opportunities if we translate the
device function to string too early. Instead we can mark the device
functions with a special calling convention and translate the device
functions in lli/llc.

best regards
ether

Reasonably Related Threads

Search for more apparently analagous threads

llvm dev - Apr 2012 - [LLVMdev] GSoC 2012 Proposal: Automatic GPGPU code generation for llvm

[LLVMdev] GSoC 2012 Proposal: Automatic GPGPU code generation for llvm

[LLVMdev] GSoC 2012 Proposal: Automatic GPGPU code generation for llvm

[LLVMdev] GSoC 2012 Proposal: Automatic GPGPU code generation for llvm

[LLVMdev] Fwd: GSoC 2012 Proposal: Automatic GPGPU code generation for llvm

Reasonably Related Threads