Chris Lattner
2012-May-11 16:20 UTC
[LLVMdev] [PATCH][RFC] Add llvm.codegen Intrinsic To Support Embedded LLVM IR Code Generation
Hi guys, Just catching up on an interesting thread :) On May 7, 2012, at 1:15 AM, Tobias Grosser wrote:> I believe this can be a way worth going, > but I doubt now is the right moment for it. I don't share your opinion > that it is easy to move LLVM-IR in this direction, but I rather believe > that this is an engineering project that will take several months of > full time work.>From a philosophical perspective, there can be times when it makes sense to do something short-term to gain experience, but we try not to keep that sort of thing in for a whole release cycle, because then we have to be compatible with it forever.Also, I know you're not saying it but the "I don't want to do the right thing, because it is too much work" sentiment grates against me: that's a perfect case for keeping a patch local and out of the llvm.org tree. Again, I know that this is not what you're trying to get at. David wrote:> Again, we have many of the changes to make this possible. I hope to > send them for review as we upgrade to 3.1.A vague promise to release some code that may or may not be useful is also not particularly useful. On May 8, 2012, at 11:49 AM, Tobias Grosser wrote:> I want clang to automatically create executables that use CUDA/OpenCL to > offload core computations (from plain C code). This should be > implemented in an external LLVM-IR optimization pass. > > clang -Xclang -load -Xclang CUDAGenerator.so file.c -O3 -mllvm -offload-cuda > > The very same should work for Pure, dragonegg and basically any compiler > based on LLVM. So I do not want to change clang at all (except of > possibly linking to -lcuda).Ok, that *is* an interesting use case. It would be great for LLVM to support this kind of thing. We're clearly not set up for it out of the box right now. On May 8, 2012, at 2:08 AM, Tobias Grosser wrote:> In terms of the complexity. The only alternative proposal I have heard > of was making LLVM-IR multi module aware or adding multi-module support > to all LLVM-IR tools. Both of these changes are way more complex than > the codegen intrinsic. Actually, they are soo complex that I doubt that > they can be implemented any time soon. What is the simpler approach you > are talking about?I also don't like the intrinsic, but not because of security ;-). For me, it is because embedding arbitrary blobs of IR in an *instruction* doesn't make sense. The position of the instruction in the parent function doesn't necessarily have anything to do with the code attached, the intrinsic can be duplicated, deleted, moved around, etc. It is also poorly specified what is allowed and legal. Unlike the related-but-different problem of "multi-versioning", it also doesn't make sense for PTX code to be functions in the same module as X86 IR functions. If your desire was for a module to have an SSE2, SSE3, and SSE4 version of the same function, then it *would* make sense for them to be in the same module... because there is linkage between them, and a runtime dispatcher. We don't have the infrastructure yet for per-function CPU flags, but this is something that we will almost certainly grow at some point (just need a clean design). This doesn't help you though. :) The design that makes sense to me for this is the multi-module approach. The PTX and X86 code *should* be in different LLVM Modules from each other. I agree that this makes a "vectorize host code to the GPU" optimization pass different than other existing passes, but I don't think that's a bad thing. Realistically, the driver compiler that this is embedded into (clang, dragonegg or whatever) will need to know about both targets to some extent to handle command line options for selecting PTX/GPU version, deciding where and how to output both chunks of code in the output file, etc. Given that the driver has to have *some* knowledge of this anyway, it doesn't seem problematic for the second module to be passed into the pass constructor. Instead of something like: PM.add(new OffloadToCudaPass()) You end up doing: Module *CudaModule = new Module(...) PM.add(new OffloadToCudaPass(CudaModule)) This also means that the compiler driver is responsible for deciding what to do with the module after it is formed (and of course, it may be empty if nothing is offloaded). Based on the compiler its embedded into, it may immediately JIT to PTX and upload to a GPU, it may write the IR out to a file, it may run the PTX code generator and output the PTX to another section of the executable, or whatever. I do agree that this makes it more awkward to work with "opt" on the command line, and that clang plugins are ideally suited for this, but opt is already suboptimal for a lot of things (e.g. anything that requires target info) and we should improve clang plugins, not workaround their limitations IMO. What do you think? -Chris
Tobias Grosser
2012-May-15 08:22 UTC
[LLVMdev] AUTOSAVE: Re: [PATCH][RFC] Add llvm.codegen Intrinsic To Support Embedded LLVM IR Code Generation
On 05/11/2012 06:20 PM, Chris Lattner wrote:> Hi guys, > > On May 7, 2012, at 1:15 AM, Tobias Grosser wrote: >> I believe this can be a way worth going, >> but I doubt now is the right moment for it. I don't share your opinion >> that it is easy to move LLVM-IR in this direction, but I rather believe >> that this is an engineering project that will take several months of >> full time work. > >> From a philosophical perspective, there can be times when it makes sense to do something short-term to gain experience, but we try not to keep that sort of thing in for a whole release cycle, because then we have to be compatible with it forever. > > Also, I know you're not saying it but the "I don't want to do the right thing, because it is too much work" sentiment grates against me: that's a perfect case for keeping a patch local and out of the llvm.org tree. Again, I know that this is not what you're trying to get at.I was afraid it would sound like this. I previously explained why I disagree. Other people disagreed with my disagreement. ;-) This discussion definitely helps to understand the different solutions. The multi-module approach seems interesting, even though I am not yet convinced it is a better solution.> On May 8, 2012, at 2:08 AM, Tobias Grosser wrote: >> In terms of the complexity. The only alternative proposal I have heard >> of was making LLVM-IR multi module aware or adding multi-module support >> to all LLVM-IR tools. Both of these changes are way more complex than >> the codegen intrinsic. Actually, they are soo complex that I doubt that >> they can be implemented any time soon. What is the simpler approach you >> are talking about? > > I also don't like the intrinsic, but not because of security ;-). For me, it is because embedding arbitrary blobs of IR in an *instruction* doesn't make sense. The position of the instruction in the parent function doesn't necessarily have anything to do with the code attached, the intrinsic can be duplicated, deleted, moved around, etc. It is also poorly specified what is allowed and legal.The blobs are embedded as global unnamed const strings. The intrinsics references them, if needed. This models directly how a simple OpenCL or CUDA program would be written. The kernel code stored as PTX code in some globals and functions like 'cuModuleLoadDataEx' are used to load and compile such kernels at runtime. The position of the instruction itself is defined by the context in which it is used. I probably did not make it clear beforehand, but we plan to replace a computation kernel, by a heterogenous mix of host LLVM-IR and kernel calls. Something like this: for (i for (j if (..) schedule_cuda(llvm.codegen("kernel", "ptx32")) else if (..) schedule_cuda(llvm.codegen("kernel", "ptx64")) else // Fallback CPU code for (... if (..) schedule_cuda(llvm.codegen("kernel", "ptx32")) else if (..) schedule_cuda(llvm.codegen("kernel", "ptx64")) else // Fallback CPU code } This means we have host code that performs calculations that are not offloaded and that schedules the different kernel executions. The host code (or a runtime library) will also take care of deciding which GPU type we target or if fallback CPU code is needed. In case we execute on a GPU, the host code passes the PTX string to the CUDA runtime, the CUDA runtime JIT compiles it, and the host code caches the result for future use. This means the llvm.codegen() intrinsic is directly used by the host code, which compiles and schedules the kernels. It can therefore only be moved around with the corresponding host code. Moving and modifying the intrinsic with the host code seems to make sense. If e.g. a code path is provenly dead, we would automatically dead code eliminate the kernel code with the surrounding host code. The same holds for function versioning. If the host code is duplicated, we want to also duplicate the intrinsic such that the kernel code is referenced from two positions. (The kernel code is still only stored once, but it is referenced from two places). In general, what is allowed and legal follows the definition of an LLVM-IR function call (which can be marked readonly). We were aiming to not require any special handling of the intrinsic here. What do you think is not specified precisely? Maybe it can/could be fixed.> Unlike the related-but-different problem of "multi-versioning", it also doesn't make sense for PTX code to be functions in the same module as X86 IR functions. If your desire was for a module to have an SSE2, SSE3, and SSE4 version of the same function, then it *would* make sense for them to be in the same module... because there is linkage between them, and a runtime dispatcher. We don't have the infrastructure yet for per-function CPU flags, but this is something that we will almost certainly grow at some point (just need a clean design). This doesn't help you though. :)I was also reasoning about combining this with multi-versioning, but I agree multi-versioning is related-but-different.> The design that makes sense to me for this is the multi-module approach. The PTX and X86 code *should* be in different LLVM Modules from each other. I agree that this makes a "vectorize host code to the GPU" optimization pass different than other existing passes, but I don't think that's a bad thing. Realistically, the driver compiler that this is embedded into (clang, dragonegg or whatever) will need to know about both targets to some extent to handle command line options for selecting PTX/GPU version, deciding where and how to output both chunks of code in the output file, etc. > > Given that the driver has to have *some* knowledge of this anyway, it doesn't seem problematic for the second module to be passed into the pass constructor. Instead of something like: > > PM.add(new OffloadToCudaPass()) > > You end up doing: > Module *CudaModule = new Module(...) > PM.add(new OffloadToCudaPass(CudaModule)) > > This also means that the compiler driver is responsible for deciding what to do with the module after it is formed (and of course, it may be empty if nothing is offloaded). Based on the compiler its embedded into, it may immediately JIT to PTX and upload to a GPU, it may write the IR out to a file, it may run the PTX code generator and output the PTX to another section of the executable, or whatever. I do agree that this makes it more awkward to work with "opt" on the command line, and that clang plugins are ideally suited for this, but opt is already suboptimal for a lot of things (e.g. anything that requires target info) and we should improve clang plugins, not workaround their limitations IMO. > > What do you think?It seems we agree that host and kernel code should be in different modules. That is nice. Instead of embedding the kernel modules directly into the host module, you propose to pass empty kernel modules to the constructor of the CUDA offload pass and to extract the CUDA kernels into those modules. Your approach removes the need to add File I/O to the optimization pass. This is a very positive point. I am still unsure about the following questions o Extracting multiple kernels A single computation normally schedules several kernels, both, to specialize for different hardware, but also to calculate different parts of the problem. How would you model this? Returning a list of modules? o How to reference the kernel modules from host code We need a way to reference the kernel modules from the host code? Your proposal does not specify anything here. When the kernel code is directly embedded in the host IR the function calls to the CUDA/OpenCL runtime can directly reference it (possibly through the llvm.codegen() intrinsic). Are you suggesting some intrinsics to reference the externally stored modules? o How much logic to put into the driver Our current idea was to put the entire logic of loading, compiling and running kernels into the host code. This enables us to change this code independently of the driver and to embed complex logic here. The only driver change would be to ask clang to add -lcuda or -lopencl. This could be done with a clang plugin. It seems you want to put more logic into the driver. Where would you e.g. implement code that caches the kernels and compiles them just in time (and only if they are actually executed)? Would this be part of the driver? How would you link this with fallback host code? If different optimizer projects implement different strategies here, are you proposing to commit all those to the clang driver or to extend clang plugins to handle this? What about non clang plugins like dragonegg. Would the driver changes need to be ported to dragonegg, too? Thanks again for your ideas Tobi
Possibly Parallel Threads
- [LLVMdev] [NVPTX] For linkonce_odr NVPTX generates .weak, but even newest PTXAS can't handle it
- [LLVMdev] [NVPTX] For linkonce_odr NVPTX generates .weak, but even newest PTXAS can't handle it
- [LLVMdev] [PATCH][RFC] Add llvm.codegen Intrinsic To Support Embedded LLVM IR Code Generation
- [LLVMdev] [PATCH][RFC] Add llvm.codegen Intrinsic To Support Embedded LLVM IR Code Generation
- [LLVMdev] [PATCH][RFC] Add llvm.codegen Intrinsic To Support Embedded LLVM IR Code Generation