After reading about Intel's 'Shevlin Park' project to implement C++AMP in llvm/clang, and failing to find any code for it, I decided to try to implement something similar. I did it as an excuse to explore and hack on llvm/clang, which I hadn't done before, but it's now at the point where it will run the simplest matrix multiplication sample from MSDN, so I thought I might as well share it. The source is in: https://github.com/corngood/llvm.git https://github.com/corngood/clang.git https://github.com/corngood/compiler-rt.git [unchanged] https://github.com/corngood/amp.git [simple test project] It's fairly hacky, and very fragile, so don't expect anything that isn't used in the sample to work. I also haven't tested it on large datasets, and there are some things that definitely need fixing before I'd expect good performance (e.g. workgroup size). It currently works only on NVIDIA GPUs, and has only been tested on my shitty old 9600GT on amd64 linux with the stable binary drivers. The compilation process currently works like this: .cpp -> [clang++ -fc++-amp] -> .ll - compile non-amp code .cpp -> [clang++ -fc++-amp -famp-is-kernel] -> .amp.ll - compile amp kernels only .amp.ll -> [opt -amp-to-opencl] -> .nvvm.ll - create kernel wrapper to deal with buffer/const inputs - add nvvm annotations .nvvm.ll -> [llc -march=nvptx] -> .ptx - compile kernels to NVPTX (unchanged) .ll + .ptx -> [opt -amp-create-stubs .ptx] -> .opt.ll - embed ptx as array data - create functions to get kernel info, load inputs, etc .opt.ll -> [llc] -> .o - unchanged The clang steps only differ in codegen, so eventually they should be combined into one clang call. NVPTX is meant to be replaced with SPIR at some point, to make it portable, which is why I didn't bother with text kernel generation. I won't go into implementation details, but if anyone is interested, or working on something similar, feel free to get in touch. Thanks, Dave McFarland
----- Original Message -----> From: corngood at gmail.com > To: llvmdev at cs.uiuc.edu > Sent: Saturday, April 13, 2013 9:13:57 PM > Subject: [LLVMdev] C++AMP -> OpenCL (NVPTX) prototype > > After reading about Intel's 'Shevlin Park' project to implement > C++AMP in > llvm/clang, and failing to find any code for it, I decided to try to > implement > something similar. I did it as an excuse to explore and hack on > llvm/clang, > which I hadn't done before, but it's now at the point where it will > run the > simplest matrix multiplication sample from MSDN, so I thought I might > as well > share it. > > The source is in: > https://github.com/corngood/llvm.git > https://github.com/corngood/clang.git > https://github.com/corngood/compiler-rt.git [unchanged] > https://github.com/corngood/amp.git [simple test project] > > It's fairly hacky, and very fragile, so don't expect anything that > isn't used > in the sample to work. I also haven't tested it on large datasets, > and there > are some things that definitely need fixing before I'd expect good > performance > (e.g. workgroup size). It currently works only on NVIDIA GPUs, and > has only > been tested on my shitty old 9600GT on amd64 linux with the stable > binary > drivers. > > The compilation process currently works like this: > > .cpp -> [clang++ -fc++-amp] -> .ll > - compile non-amp code > > .cpp -> [clang++ -fc++-amp -famp-is-kernel] -> .amp.ll > - compile amp kernels only > > .amp.ll -> [opt -amp-to-opencl] -> .nvvm.ll > - create kernel wrapper to deal with buffer/const inputs > - add nvvm annotations > > .nvvm.ll -> [llc -march=nvptx] -> .ptx > - compile kernels to NVPTX (unchanged) > > .ll + .ptx -> [opt -amp-create-stubs .ptx] -> .opt.ll > - embed ptx as array data > - create functions to get kernel info, load inputs, etc > > .opt.ll -> [llc] -> .o > - unchanged > > The clang steps only differ in codegen, so eventually they should be > combined > into one clang call. NVPTX is meant to be replaced with SPIR at some > point, > to make it portable, which is why I didn't bother with text kernel > generation. > > I won't go into implementation details, but if anyone is interested, > or > working on something similar, feel free to get in touch.Dave, [I've copied the cfe-dev list as well.] Thanks for sharing this! I think this sounds very interesting. I don't know much about AMP, but I do have users who are also interested in accelerator targeting, and I'd like you to share your thoughts on: 1. Does your implementation share common functionality with the 'captured statement' work that Intel is currently doing (in order to support Cilk, OpenMP, etc.)? If you're not aware of it, see: http://lists.cs.uiuc.edu/pipermail/cfe-commits/Week-of-Mon-20130408/077615.html -- This should end up in trunk soon. I ask because if the current captured statement patches would almost, but not quite, work for you, then it would be interesting to understand why. 2. What will be necessary to eliminate the two-clang-invocations problem. If we ever grow support for embedded accelerator targeting (through AMP, OpenACC, OpenMP 4+, etc.), it sounds like this will be a common requirement, and if I had to guess, there is common interest in putting the necessary infrastructure in place. -Hal> > Thanks, > Dave McFarland > > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev >
On April 14, 2013 09:42:28 AM Hal Finkel wrote:> Dave, > > [I've copied the cfe-dev list as well.] > > Thanks for sharing this! I think this sounds very interesting. I don't know > much about AMP, but I do have users who are also interested in accelerator > targeting, and I'd like you to share your thoughts on: > > 1. Does your implementation share common functionality with the 'captured > statement' work that Intel is currently doing (in order to support Cilk, > OpenMP, etc.)? If you're not aware of it, see: > http://lists.cs.uiuc.edu/pipermail/cfe-commits/Week-of-Mon-20130408/077615. > html -- This should end up in trunk soon. I ask because if the current > captured statement patches would almost, but not quite, work for you, then > it would be interesting to understand why.Kernels in AMP are represented by a lambda, so I haven't had to do anything special to capture variables. I do some work in the opt passes to marshal certain types (buffer references so far; also textures, etc in the future), so maybe there's some overlap there. Thanks for the link, I'll have to read more about it.> 2. What will be necessary to eliminate the two-clang-invocations problem. > If we ever grow support for embedded accelerator targeting (through AMP, > OpenACC, OpenMP 4+, etc.), it sounds like this will be a common > requirement, and if I had to guess, there is common interest in putting the > necessary infrastructure in place.The only reason I have two clang invokations right now is because of how I dealt with adress-spaces. In the Shevlin Park presentation, they mentioned doing analysis and assigning address-spaces after codegen, but I just assign them using __attribute__((addressspace)) for now, and zero them out for CPU codegen with a TargetOpt. It sort of piggybacks on the OpenCL -> NVPTX/SPIR/AMD/etc address space abstraction. The other differences are similar to how CodeGenOpts.CUDAIsDevice works. Unfortunately it won't be sufficient for a full implementation of AMP, which doesn't specify (to my knowledge) any address-space declaration on pointer types, but still allows pointers into buffers in various address-spaces.> > -HalTo be honest, I'm not crazy about the AMP specification, I just like the idea of compiling a heterogenous module for host/device code, which can be easily integrated into existing C++ application. I'd be happy for it to drop the MS specific syntax like properties, use C++ attributes wherever possible instead of keywords, and have explicit address spaces like cuda/opencl. I think the big problem is going to be making it robustly target two very different targets in one pass. Most obviously, supporting different bitness for host/device. My testing was all on 64/32 bit, but all other combinations are available in practice. - Dave
Possibly Parallel Threads
- [LLVMdev] C++AMP -> OpenCL (NVPTX) prototype
- [LLVMdev] Behaviour of NVPTX intrinsic
- [LLVMdev] [NVPTX] llc -march=nvptx64 -mcpu=sm_20 generates invalid zero align for device function params
- [LLVMdev] [llvm-commits] [PATCH][RFC] NVPTX Backend
- [LLVMdev] [NVPTX] llc -march=nvptx64 -mcpu=sm_20 generates invalid zero align for device function params