Yabin Hu
2012-Apr-03 23:02 UTC
[LLVMdev] GSoC 2012 Proposal: Automatic GPGPU code generation for llvm
Hi Justin, 2012/4/3 Justin Holewinski <justin.holewinski at gmail.com>> *Motivation* >> With the broad proliferation of GPU computing, it is very important to >> provide an easy and automatic tool to develop or port the applications to >> GPU for normal developers, especially for those domain experts who want to >> harness the huge computing power of GPU. Polly has implemented many >> transformations, such as tiling, auto-vectorization and openmp code >> generation. With the help of LLVM's PTX backend, I plan to extend Polly >> with the feature of GPGPU code generation. >> > > Very interesting! I'm quite familiar with Muthu's work, and putting that > into LLVM would be great. If done right, it could apply to any > heterogeneous systems, including AMD GPUs. >As the maintainer and primary developer on the PTX back-end, please feel> free to contact me with any issues/suggestions you have regarding the PTX > back-end!Thanks for your interest and help. I'm a bit confused by the wording here. What do you mean by 'LLVM> sub-function?' I'm assuming you mean extracting the relevant code into a > separate function, but I would just use the word 'function'.Yes, it is indeed a function. I use this word by following the methods naming style of polly's openmp code generation. I will fix this. And what do you mean by a run-time library to generate the executable> program?The runtime library is just a wrapper of cuda driver APIs in my mind. But we can add our debug info and make the cuda APIs changes apparent to users. Are you proposing to side-step the LLVM code generator LLC? It seems like> a reasonable approach would be to write an LLVM pass (or set of passes) > that takes as input a single IR file, and produces two: (1) the GPU > kernel/device code, and (2) the non-translatable IR with GPU code replaced > by appropriate CUDA Driver API calls. Then, both of these can pass through > the opt/llc tools with the appropriate selection for optimization passes > and target back-end. > > This way, you could fairly easily create a GPGPU compiler by writing a > simple wrapper around Clang (or better yet, improve Clang to support > multiple targets simultaneously!) >Ether give a similar suggestion to this point. Here I copy the reply to him to explain why I choose to put the transformation pass embedded in my implementation. The original motivation we do this, is to provide a jit compiler for our language frontend (a subset of matlab/octave). I've extended lli to implement a jit compiler (named gvm) to use polly dynamically. However, preliminary results show that the overhead is heavy. I choose to offload the dynamic optimization from the jitting process. And also putting the LLVM to PTX asm string pass into polly can provide a kind of one-touch experience to users. Please imagine such a user scenario. When a user open a matlab source file or a folder contained source files, we can start to compile the source statically and use polly and opt to optimize it to get the optimal version llvm ir. Finally, when the user click run or the enter key, we just need jit the llvm ir as normal one, minimizing the dynamic overhead. Thanks again! best regards, Yabin -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20120404/b2d88915/attachment.html>
Yabin Hu
2012-Apr-03 23:41 UTC
[LLVMdev] GSoC 2012 Proposal: Automatic GPGPU code generation for llvm
Hi Justin, the non-translatable IR with GPU code replaced by appropriate CUDA Driver> API calls.One of CUDA driver apis (cuLaunch) need a ptx asm string as its input. So if I want to provide a one-touch solution and don't introduce any changes to tools outside polly, I must prepare the ptx string before I can generate the correct non-translatable IR part. As your suggestion, It may be implemented as leaving an input parameter slot for ptx string in the main method of the non-translatable IR part. Maybe I can implement both versions of this. Let Tobi judge which one is better to be integrated into polly. best regards, Yabin -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20120404/e9e063f9/attachment.html>
Yabin Hu
2012-Apr-04 02:59 UTC
[LLVMdev] GSoC 2012 Proposal: Automatic GPGPU code generation for llvm
Hi Tobi, I revise the proposal here. Can you review for me and give comments again? Thanks. *Abstract* Very often, developing an GPGPU application is a time-consuming, complex, error-prone and iterative process. In this project, I propose to build an automatic GPGPU code generation framework for LLVM, based on two successful LLVM (sub-)projects - Polly and PTX backend. This can be very useful to ease the burden of the long learning curve of various GPU programming model. *Motivation* With the broad proliferation of GPU computing, it is very important to provide an easy and automatic tool to develop or port the applications to GPU for normal developers, especially for those domain experts who want to harness the huge computing power of GPU. Polly has implemented many transformations, such as tiling, auto-vectorization and openmp code generation. And GPGPU code generation has been planned in [1]. With the help of LLVM's PTX backend, I plan to extend Polly with the feature of GPGPU code generation. *Project Detail* There are several successful projects on source to source automatic gpu code transformation. In this project, I will follow the method proposed in [2] by Muthu Manikandan Baskaran etc. Since automatic GPGPU code generation is quite a complex problem, specifically, we target two kinds of test cases. One is comprised of pure parallel loops, just like the following codes. parfor(int i=0 to M) parfor(int j=0 to N) LoopBody(i, j); Another one is that all the loops in it are parallel except the inner-most one, just like this: parfor(int i=0 to M) parfor(int j=0 to N) non-parfor(int k=0 to K) LoopBody(i, j, k); The LoopBody part should be limited to instructions or functions calls (intrinsic) which can be handled by LLVM's PTX backend. The work flow of our code generator is as follows. We first use Polly's jscop file importer to get a wanted 4-level parallel tiled code. Then we extract the loop body (or inner non-parallel loops) into a LLVM function, tagging it with PTX_Kernel or PTX_Device call convention. Then we use PTX backend to translate the PTX_Kernel and PTX_Device functions into strings of the corresponding PTX codes. After that we transformed non-translatable part of the LLVM IRs with GPU runtime library calls inserted. The execution configure of GPU is acquired from external user-specified jscop files, which has been implemented by Polly. Finally, we provide an runtime library to generate the executable program or run the optimized LLVM IRs with JIT compiler like li. There are two key challenges in this project. 1. How to automatically insert the synchronization codes. This is very important to preserve the original semantics. We must detect where we need insert them correctly. 2. How to automatically generate the memory copy operation between host and device. We must transport the input data to GPU and copy the results back. Fortunately, Polly has implemented a very expressive way to describe memory access. We will follow the taxonomy proposed in [3] by Chris Gregg etc. *Timeline* - May 21 ~ June 11 Preliminary GPGPU Code Generation In this stage, implement gpu code generation for 1d and 2d parallel loops test cases which needn't to copy host memory as input. Verify that our method is workable. - June 12 ~ June 24 automatic memory copy insertions. In this stage, insert memory copy operation for all the array accesses correctly according to the Read/Write property provided by Polly. - June 25 ~ July 8 Code Generation for Parallel Loops With Non-parallel Inner-most Loop. In this stage, implement gpgpu code generation for classical matrix multiplication test case. *for*(i=0; i<N; i++) { * for*(j=0; j<N; j++) { * for*(k=0; k<N; k++) C[i][j] = C[i][j] + A[i][k] * B[k][j]; } } - July 9 ~ July 15 Midterm evaluation and writing documents. - July 16 ~ July 22 Automatic Synchronization Insertion. In this stage, implement Muthu's method instroduced in Section 4.3 in [2] to insert barrier synchronizations to preserve semantic-equivalent. - July 23 ~ August 5 Test on Polybench Benchmarks and Report Results. - August 6 ~ August 12 Summarize and Complete the Final Documents. *Project Experience* I participated in several projects related to binary translation (optimization) and run-time system. And I implemented a frontend for numerical computing languages like octave/matlab, following the style of clang. Recently, I work very close with Polly team to contribute some patches [4] and investigate lots of details about polyhedral transformation. * * *References* 1. Tobias Grosser, Ragesh A. *Polly - First Successful Optimizations - How to proceed?* LLVM Developer Meeting 2011. 2. Muthu Manikandan Baskaran, J. Ramanujam and P. Sadayappan.* **Automatic C-to-CUDA Code Generation for Affine Programs.* International Conference on Compiler Construction (CC) 2010. 3. Chris Gregg and Kim Hazelwood. *Where is the Data? Why You Cannot Debate GPU vs. CPU Performance Without the Answer**.** *International Symposium on Performance Analysis of Systems and Software (ISPASS) 2011. 4. http://llvm.org/viewvc/llvm-project?view=rev&revision=153319. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20120404/1f7b3e74/attachment.html>
Possibly Parallel Threads
- [LLVMdev] GSoC 2012 Proposal: Automatic GPGPU code generation for llvm
- [LLVMdev] GSoC 2012 Proposal: Automatic GPGPU code generation for llvm
- [LLVMdev] GSoC 2012 Proposal: Automatic GPGPU code generation for llvm
- [LLVMdev] GSoC 2012 Proposal: Automatic GPGPU code generation for llvm
- Legal names for Functions and other Identifiers