Tobias Grosser
2012-Apr-04 11:49 UTC
[LLVMdev] GSoC 2012 Proposal: Automatic GPGPU code generation for llvm
On 04/03/2012 03:13 PM, Hongbin Zheng wrote:> Hi Yabin, > > Instead of compile the LLVM IR to PTX asm string in a ScopPass, you > can also the improve llc/lli or create new tools to support the code > generation for Heterogeneous platforms[1], i.e. generate code for more > than one target architecture at the same time. Something like this is > not very complicated and had been implemented[2,3] by some people, but > not available in LLVM mainstream. Implement this could make your GPU > project more complete.I agree with ether that we should ensure as much work as possible is done within generic, not Polly specific code. In terms of heterogeneous code generation the approach Yabin proposed seems to work, but we should discuss other approaches. For the moment, I believe his proposal is very similar the model of OpenCL and CUDA. He splits the code into host and kernel code. The host code is directly compiled to machine code by the existing tools (clang/llc). The kernel code is stored as a string and only at execution time it is compiled to platform specific code. Are there any other approaches that could be taken? What specific heterogeneous platform support would be needed. At the moment, it seems to me we actually do not need too much additional support. Cheers Tobi
Justin Holewinski
2012-Apr-04 14:17 UTC
[LLVMdev] GSoC 2012 Proposal: Automatic GPGPU code generation for llvm
On Wed, Apr 4, 2012 at 4:49 AM, Tobias Grosser <tobias at grosser.es> wrote:> On 04/03/2012 03:13 PM, Hongbin Zheng wrote: > > Hi Yabin, > > > > Instead of compile the LLVM IR to PTX asm string in a ScopPass, you > > can also the improve llc/lli or create new tools to support the code > > generation for Heterogeneous platforms[1], i.e. generate code for more > > than one target architecture at the same time. Something like this is > > not very complicated and had been implemented[2,3] by some people, but > > not available in LLVM mainstream. Implement this could make your GPU > > project more complete. > > I agree with ether that we should ensure as much work as possible is > done within generic, not Polly specific code. >Right, this has the potential to impact more people that the users of Polly. By moving as much as possible to generic LLVM, that infrastructure can be leveraged by people doing work outside of the polyhedral model.> > In terms of heterogeneous code generation the approach Yabin proposed > seems to work, but we should discuss other approaches. For the moment, > I believe his proposal is very similar the model of OpenCL and CUDA. He > splits the code into host and kernel code. The host code is directly > compiled to machine code by the existing tools (clang/llc). The kernel > code is stored as a string and only at execution time it is compiled to > platform specific code. >Depending on your target, that may be the only way. If your target is OpenCL-compatible accelerators, then your only portable option is save the kernel code as OpenCL text and let the driver JIT compiler it at run-time. Any other approach is not guaranteed to be compatible across platforms or even driver versions. In this case, the target is the CUDA Driver API, so you're free to pass along any valid PTX assembly. In this case, you still pass the PTX code as a string to the driver, which JIT compiles it to actual GPU device code at run-time.> > Are there any other approaches that could be taken? What specific > heterogeneous platform support would be needed. At the moment, it seems > to me we actually do not need too much additional support. >I could see this working without any additional support, if needed. It seems like this proposal is dealing with LLVM IR -> LLVM IR code generation, so the only thing that is really needed is a way to split the IR into multiple separate IRs (one for host, and one for each accelerator target). This does not really need any supporting infrastructure, as you could imagine an opt pass processing the input IR and transforming it to the host IR, and emitting the device IR as a separate module. Now if you're talking about source-level support for heterogeneous platforms (e.g. C++ AMP), then you would need to adapt Clang to support emission of multiple IR modules. Basically, the AST would need to be split into host and device portions, and codegen'd appropriately. I feel that is far beyond the scope of this proposal, though.> > Cheers > Tobi > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev >-- Thanks, Justin Holewinski -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20120404/c72ac3d6/attachment.html>
Tobias Grosser
2012-Apr-04 14:35 UTC
[LLVMdev] GSoC 2012 Proposal: Automatic GPGPU code generation for llvm
On 04/04/2012 04:17 PM, Justin Holewinski wrote:> > > On Wed, Apr 4, 2012 at 4:49 AM, Tobias Grosser <tobias at grosser.es > <mailto:tobias at grosser.es>> wrote: > > On 04/03/2012 03:13 PM, Hongbin Zheng wrote: > > Hi Yabin, > > > > Instead of compile the LLVM IR to PTX asm string in a ScopPass, you > > can also the improve llc/lli or create new tools to support the code > > generation for Heterogeneous platforms[1], i.e. generate code for > more > > than one target architecture at the same time. Something like this is > > not very complicated and had been implemented[2,3] by some > people, but > > not available in LLVM mainstream. Implement this could make your GPU > > project more complete. > > I agree with ether that we should ensure as much work as possible is > done within generic, not Polly specific code. > > > Right, this has the potential to impact more people that the users of > Polly. By moving as much as possible to generic LLVM, that > infrastructure can be leveraged by people doing work outside of the > polyhedral model.To make stuff generic it is often helpful to know the other possible use cases. I consequently encourage everybody to point out such use cases or to state which exact functionality they might want to reuse. Otherwise, there it may happen that we focus a little too much on the needs of Polly.> In terms of heterogeneous code generation the approach Yabin proposed > seems to work, but we should discuss other approaches. For the moment, > I believe his proposal is very similar the model of OpenCL and CUDA. He > splits the code into host and kernel code. The host code is directly > compiled to machine code by the existing tools (clang/llc). The kernel > code is stored as a string and only at execution time it is compiled to > platform specific code. > > > Depending on your target, that may be the only way. If your target is > OpenCL-compatible accelerators, then your only portable option is save > the kernel code as OpenCL text and let the driver JIT compiler it at > run-time. Any other approach is not guaranteed to be compatible across > platforms or even driver versions. > In this case, the target is the CUDA Driver API, so you're free to pass > along any valid PTX assembly. In this case, you still pass the PTX code > as a string to the driver, which JIT compiles it to actual GPU device > code at run-time.I would like to highlight that with the word 'string' I was not referring to 'OpenCL C code'. I don't think it is a practical approach to recover OpenCL C code, especially as the LLVM-IR C backend was recently removed. I meant to describe that the kernel code is stored as a global variable in the host binary (in some intermediate representation such as LLVM-IR, PTX or a vendor specific OpenCLBinary) and is loaded at execution time into the OpenCL or CUDA runtime, where it is compiled down to hardware specific machine code.> Are there any other approaches that could be taken? What specific > heterogeneous platform support would be needed. At the moment, it seems > to me we actually do not need too much additional support. > > > I could see this working without any additional support, if needed. It > seems like this proposal is dealing with LLVM IR -> LLVM IR code > generation, so the only thing that is really needed is a way to split > the IR into multiple separate IRs (one for host, and one for each > accelerator target). This does not really need any supporting > infrastructure, as you could imagine an opt pass processing the input IR > and transforming it to the host IR, and emitting the device IR as a > separate module.Yes. And instead of saving the two modules in separate files, we can store the kernel modul as a 'string' in the host module and add the necessary library calls to load it at run time. This will give a smooth user experience and requires almost no additional infrastructure. (At the moment this will only work with NVidia, but I am confident there will be OpenCL vendor extensions that allow loading LLVM-IR kernels. AMD OpenCL can e.g. load LLVM-IR, even though it is not officially supported)> Now if you're talking about source-level support for heterogeneous > platforms (e.g. C++ AMP), then you would need to adapt Clang to support > emission of multiple IR modules. Basically, the AST would need to be > split into host and device portions, and codegen'd appropriately. I > feel that is far beyond the scope of this proposal, though.Yes. No source level transformations or targeting anything else than PTX, AMDIL or LLVM-IR. Cheers Tobi
Hongbin Zheng
2012-Apr-04 16:48 UTC
[LLVMdev] Fwd: GSoC 2012 Proposal: Automatic GPGPU code generation for llvm
oops, forget to cc the dev-list hi tobi,> > > Yes. And instead of saving the two modules in separate files, we can store > the kernel modul as a 'string' in the host module and add the necessary > library calls to load it at run time. This will give a smooth user > experience and requires almost no additional infrastructure.We may lost some co-optimization opportunities if we translate the device function to string too early. Instead we can mark the device functions with a special calling convention and translate the device functions in lli/llc. best regards ether
Possibly Parallel Threads
- [LLVMdev] GSoC 2012 Proposal: Automatic GPGPU code generation for llvm
- [LLVMdev] GSoC 2012 Proposal: Automatic GPGPU code generation for llvm
- [LLVMdev] GSoC 2012 Proposal: Automatic GPGPU code generation for llvm
- [LLVMdev] GSoC 2012 Proposal: Automatic GPGPU code generation for llvm
- [LLVMdev] GSoC 2012 Proposal: Automatic GPGPU code generation for llvm