dag at cray.com
2012-Apr-30 19:55 UTC
[LLVMdev] [PATCH][RFC] Add llvm.codegen Intrinsic To Support Embedded LLVM IR Code Generation
Tobias Grosser <tobias at grosser.es> writes:> To write optimizations that yield embedded GPU code, we also looked into > three other approaches: > > 1. Directly create embedded target code (e.g. PTX) > > This would mean the optimization pass extracts device code internally > and directly generate the relevant target code. This approach would > require our generic optimization pass to be directly linked with the > specific target back end. This is an ugly layering violation and, in > addition, it causes major troubles in case the new optimization should > be dynamically loaded.IMHO it's a bit unrealistic to have a target-independent optimization layer. Almost all optimization wants to know target details at some point. I think we can and probably should support that. We can allow passes to gracefully fall back in the cases where target information is not available.> 2. Extend the LLVM-IR files to support heterogeneous modules > > This would mean we extend LLVM-IR, such that IR for different targets > can be stored within a single IR file. This approach could be integrated > nicely into the LLVM code generation flow and would yield readable > LLVM-IR even for the device code. However, it adds another level of > complexity to the LLVM-IR files and does not only require massive > changes in the LLVM code base, but also in compilers built on top of > LLVM-IR.I don't think the code base changes are all that bad. We have a number of them to support generating code one function at a time rather than a whole module together. They've been sitting around waiting for us to send them upstream. It would be an easy matter to simply annotate each function with its target. We don't currently do that because we never write out such IR files but it seems like a simple problem to solve to me.> 3. Generate two independent LLVM-IR files and pass them around together > > The host and device LLVM-IR modules could be kept in separate files. > This has the benefit of being user readable and not adding additional > complexity to the LLVM-IR files itself. However, separate files do not > provide information about how those files are related. Which files are > kernel files, how.where do they need to be loaded, ...? Also this > information could probably be put into meta-data or could be hard coded > into the generic compiler infrastructure, but this would require > significant additional code.I don't think metadata would work because it would not satisfy the "no semantic effects" requirement. We couldn't just drop the metadata and expect things to work.> Another weakness of this approach is that the entire LLVM optimization > chain is currently built under the assumption that a single file/module > passed around. This is most obvious with the 'opt | llc' idiom, but in > general every tool that does currently exist would need to be adapted to > handle multiple files and would possibly even need semantic knowledge > about how to connect/use them together. Just running clang or > draggonegg with -load GPGPUOptimizer.so would not be possible.Again, we have many of the changes to make this possible. I hope to send them for review as we upgrade to 3.1.> All of the previous approaches require significant changes all over the > code base and would cause trouble with loadable optimization passes. The > intrinsic based approach seems to address most of the previous problems.I'm pretty uncomfortable with the proposed intrinsic. It feels tacked-on and not in the LLVM spirit. We should be able to extend the IR to support multiple targets. We're going to need this kind of support for much more than GPUs in thefuture. Heterogenous computing is here to stay. -Dave
dag at cray.com
2012-Apr-30 20:03 UTC
[LLVMdev] [PATCH][RFC] Add llvm.codegen Intrinsic To Support Embedded LLVM IR Code Generation
<dag at cray.com> writes:> Tobias Grosser <tobias at grosser.es> writes: > >> To write optimizations that yield embedded GPU code, we also looked into >> three other approaches: >> >> 1. Directly create embedded target code (e.g. PTX) >> >> This would mean the optimization pass extracts device code internally >> and directly generate the relevant target code. This approach would >> require our generic optimization pass to be directly linked with the >> specific target back end. This is an ugly layering violation and, in >> addition, it causes major troubles in case the new optimization should >> be dynamically loaded. > > IMHO it's a bit unrealistic to have a target-independent optimization > layer. Almost all optimization wants to know target details at some > point. I think we can and probably should support that. We can allow > passes to gracefully fall back in the cases where target information is > not available.I think I misread your intent here. It is indeed a very bad layering violation to have opt generate code. In the response above I am talking about making target characteristics available to opt passes if it is available. I think the latter is important to get good performance. -Dave
Justin Holewinski
2012-May-01 04:32 UTC
[LLVMdev] [PATCH][RFC] Add llvm.codegen Intrinsic To Support Embedded LLVM IR Code Generation
On Mon, Apr 30, 2012 at 12:55 PM, <dag at cray.com> wrote:> Tobias Grosser <tobias at grosser.es> writes: > > > To write optimizations that yield embedded GPU code, we also looked into > > three other approaches: > > > > 1. Directly create embedded target code (e.g. PTX) > > > > This would mean the optimization pass extracts device code internally > > and directly generate the relevant target code. This approach would > > require our generic optimization pass to be directly linked with the > > specific target back end. This is an ugly layering violation and, in > > addition, it causes major troubles in case the new optimization should > > be dynamically loaded. > > IMHO it's a bit unrealistic to have a target-independent optimization > layer. Almost all optimization wants to know target details at some > point. I think we can and probably should support that. We can allow > passes to gracefully fall back in the cases where target information is > not available. > > > 2. Extend the LLVM-IR files to support heterogeneous modules > > > > This would mean we extend LLVM-IR, such that IR for different targets > > can be stored within a single IR file. This approach could be integrated > > nicely into the LLVM code generation flow and would yield readable > > LLVM-IR even for the device code. However, it adds another level of > > complexity to the LLVM-IR files and does not only require massive > > changes in the LLVM code base, but also in compilers built on top of > > LLVM-IR. > > I don't think the code base changes are all that bad. We have a number > of them to support generating code one function at a time rather than a > whole module together. They've been sitting around waiting for us to > send them upstream. It would be an easy matter to simply annotate each > function with its target. We don't currently do that because we never > write out such IR files but it seems like a simple problem to solve to > me. >If such changes are almost ready to be up-streamed, then great! It just seems like a fairly non-trivial task to actually implement function-level target selection, especially when you consider function call semantics, taking the address of a function, etc. If you have a global variable, what target "sees" it? Does it need to be annotated along with the function? Can functions from two different targets share this pointer? At first glance, there seems to be many non-trivial issues that are heavily dependent on the nature of the target. For Yabin's use-case, the X86 portions need to be compiled to assembly, or even an object file, while the PTX portions need to be lowered to an assembly string and embedded in the X86 source (or written to disk somewhere). If you're targeting Cell, in contrast, you'd want to compile both down to object files. Don't get me wrong, I think this is something we need to do and the llvm.codegen intrinsic is a band-aid solution, but I don't see this as a simple problem.> > > 3. Generate two independent LLVM-IR files and pass them around together > > > > The host and device LLVM-IR modules could be kept in separate files. > > This has the benefit of being user readable and not adding additional > > complexity to the LLVM-IR files itself. However, separate files do not > > provide information about how those files are related. Which files are > > kernel files, how.where do they need to be loaded, ...? Also this > > information could probably be put into meta-data or could be hard coded > > into the generic compiler infrastructure, but this would require > > significant additional code. > > I don't think metadata would work because it would not satisfy the "no > semantic effects" requirement. We couldn't just drop the metadata and > expect things to work. > > > Another weakness of this approach is that the entire LLVM optimization > > chain is currently built under the assumption that a single file/module > > passed around. This is most obvious with the 'opt | llc' idiom, but in > > general every tool that does currently exist would need to be adapted to > > handle multiple files and would possibly even need semantic knowledge > > about how to connect/use them together. Just running clang or > > draggonegg with -load GPGPUOptimizer.so would not be possible. > > Again, we have many of the changes to make this possible. I hope to > send them for review as we upgrade to 3.1. > > > All of the previous approaches require significant changes all over the > > code base and would cause trouble with loadable optimization passes. The > > intrinsic based approach seems to address most of the previous problems. > > I'm pretty uncomfortable with the proposed intrinsic. It feels > tacked-on and not in the LLVM spirit. We should be able to extend the > IR to support multiple targets. We're going to need this kind of > support for much more than GPUs in thefuture. Heterogenous computing is > here to stay. >For me, the bigger question is: do we extend the IR to support multiple targets, or do we keep the one-target-per-module philosophy and derive some other way of representing how the modules fit together? I can see pros and cons for both approaches. What if instead of per-function annotations, we implement something like module file sections? You could organize a module file into logical sections based on target architecture. I'm just throwing that out there.> > -Dave > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev >-- Thanks, Justin Holewinski -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20120430/6649dc66/attachment.html>
dag at cray.com
2012-May-01 15:22 UTC
[LLVMdev] [PATCH][RFC] Add llvm.codegen Intrinsic To Support Embedded LLVM IR Code Generation
Justin Holewinski <justin.holewinski at gmail.com> writes:> I don't think the code base changes are all that bad. We have a number > of them to support generating code one function at a time rather than a > whole module together. They've been sitting around waiting for us to > send them upstream. It would be an easy matter to simply annotate each > function with its target. We don't currently do that because we never > write out such IR files but it seems like a simple problem to solve to > me. > > If such changes are almost ready to be up-streamed, then great!Just to clariofy, the current changes simply allow a function to be completely processed (including asm generation) before the next function is sent to codegen.> It just seems like a fairly non-trivial task to actually implement > function-level target selection, especially when you consider function > call semantics, taking the address of a function, etc.For something like PTX, runtime calls take care of the call semantics so it is either up to the user or the frontend to set up the runtime calls correctly. We don't need to completely solve this problem. Yet. :)> If you have a global variable, what target "sees" it? Does it need to > be annotated along with the function?For a tool like llc, wouldn't it be simply a matter of changing TheTarget and reconstituting the various passes? The changes we have waiting to upstream already allow us to reconstitute passes. I sometimes use this to turn on/off debugging on a function-level basis. The way we've constructed our backend interface should just allow us to switch the target and reinitialize everything. I'm sure I'm glossing over tons of details but I don't see a fundamental architectural problem in LLVM that would prevent this.> Can functions from two different targets share this pointer?Again, in the case of PTX it's the runtime's responsibility to ensure this. I agree passing pointers around complicates things in the general case but I also think it's a solvable problem.> For Yabin's use-case, the X86 portions need to be compiled to > assembly, or even an object file, while the PTX portions need to be > lowered to an assembly string and embedded in the X86 source (or > written to disk somewhere).I think it's just a matter of switching to a different AsmWriter. The PTX runtime can load objects from files. The code doesn't have to be a string in the x86 object file.> If you're targeting Cell, in contrast, you'd want to compile both down > to object files.I think we probably want to do that for PTX as well.> For me, the bigger question is: do we extend the IR to support > multiple targets, or do we keep the one-target-per-module philosophy > and derive some other way of representing how the modules fit > together? I can see pros and cons for both approaches.Me too.> What if instead of per-function annotations, we implement something > like module file sections? You could organize a module file into > logical sections based on target architecture. I'm just throwing that > out there.Do we allow more than one Module per file? If not, that seems like an arbitrary limitation. If we allowed that we could have each module specify a different target. -Dave
Tobias Grosser
2012-May-07 08:47 UTC
[LLVMdev] [PATCH][RFC] Add llvm.codegen Intrinsic To Support Embedded LLVM IR Code Generation
On 04/30/2012 09:55 PM, dag at cray.com wrote:> Tobias Grosser<tobias at grosser.es> writes: > >> To write optimizations that yield embedded GPU code, we also looked into >> three other approaches: >> >> 1. Directly create embedded target code (e.g. PTX) >> >> This would mean the optimization pass extracts device code internally >> and directly generate the relevant target code. This approach would >> require our generic optimization pass to be directly linked with the >> specific target back end. This is an ugly layering violation and, in >> addition, it causes major troubles in case the new optimization should >> be dynamically loaded. > > IMHO it's a bit unrealistic to have a target-independent optimization > layer. Almost all optimization wants to know target details at some > point. I think we can and probably should support that. We can allow > passes to gracefully fall back in the cases where target information is > not available.Yes, I agree it makes sense to make target-information available to the optimizers. As you noted yourself, this is different to performing target code generation in the optimizers.>> 2. Extend the LLVM-IR files to support heterogeneous modules >> >> This would mean we extend LLVM-IR, such that IR for different targets >> can be stored within a single IR file. This approach could be integrated >> nicely into the LLVM code generation flow and would yield readable >> LLVM-IR even for the device code. However, it adds another level of >> complexity to the LLVM-IR files and does not only require massive >> changes in the LLVM code base, but also in compilers built on top of >> LLVM-IR. > > I don't think the code base changes are all that bad. We have a number > of them to support generating code one function at a time rather than a > whole module together. They've been sitting around waiting for us to > send them upstream. It would be an easy matter to simply annotate each > function with its target. We don't currently do that because we never > write out such IR files but it seems like a simple problem to solve to > me.Supporting several modules in on LLVM-IR file may not be too difficult, but getting this in may still be controversial. The large amount of changes that I see are changes to the tools. At the moment all tools expect a single module coming from an LLVM-IR file. I pointed out the problems in llc and the codegen examples in my other mail.>> 3. Generate two independent LLVM-IR files and pass them around together >> >> The host and device LLVM-IR modules could be kept in separate files. >> This has the benefit of being user readable and not adding additional >> complexity to the LLVM-IR files itself. However, separate files do not >> provide information about how those files are related. Which files are >> kernel files, how.where do they need to be loaded, ...? Also this >> information could probably be put into meta-data or could be hard coded >> into the generic compiler infrastructure, but this would require >> significant additional code. > > I don't think metadata would work because it would not satisfy the "no > semantic effects" requirement. We couldn't just drop the metadata and > expect things to work.You are right, this solution requires semantic meta-data which is a non-trivial prerequisite.>> Another weakness of this approach is that the entire LLVM optimization >> chain is currently built under the assumption that a single file/module >> passed around. This is most obvious with the 'opt | llc' idiom, but in >> general every tool that does currently exist would need to be adapted to >> handle multiple files and would possibly even need semantic knowledge >> about how to connect/use them together. Just running clang or >> draggonegg with -load GPGPUOptimizer.so would not be possible. > > Again, we have many of the changes to make this possible. I hope to > send them for review as we upgrade to 3.1.Could you provide a list of the changes you have in the pipeline and a reliable timeline on when you will upstream them? How much additional work from other people is required to make this a valuable replacement of the llvm.codegen intrinsic?>> All of the previous approaches require significant changes all over the >> code base and would cause trouble with loadable optimization passes. The >> intrinsic based approach seems to address most of the previous problems. > > I'm pretty uncomfortable with the proposed intrinsic. It feels > tacked-on and not in the LLVM spirit. We should be able to extend the > IR to support multiple targets. We're going to need this kind of > support for much more than GPUs in thefuture. Heterogenous computing is > here to stay.Where exactly do you see problems with this intrinsic? It is not meant to block further work in heterogeneous computing, but to allow us to gradually improve LLVM to gain such features. It especially provides a low overhead solution that adds working heterogeneous compute capabilities for major GPU targets to LLVM. This working solution can prepare the ground for closer integrated solutions. Tobi
dag at cray.com
2012-May-07 16:24 UTC
[LLVMdev] [PATCH][RFC] Add llvm.codegen Intrinsic To Support Embedded LLVM IR Code Generation
Tobias Grosser <tobias at grosser.es> writes:> Supporting several modules in on LLVM-IR file may not be too difficult, > but getting this in may still be controversial. The large amount of > changes that I see are changes to the tools. At the moment all tools > expect a single module coming from an LLVM-IR file. I pointed out the > problems in llc and the codegen examples in my other mail.I replied to that mail so I won't repeat it all here. I don't think there's any problem given current technology. Since I don't know any details (only speculation) about what's coming in the future, I can't comment beyond that.>> Again, we have many of the changes to make this possible. I hope to >> send them for review as we upgrade to 3.1. > > Could you provide a list of the changes you have in the pipeline and a > reliable timeline on when you will upstream them? How much additional > work from other people is required to make this a valuable replacement > of the llvm.codegen intrinsic?I'll try to recall the major bits. I did this work 3-4 years ago... I think the major issue was with the AsmPrinter. There's global state kept around that needs to be cleared between invocations. The initialization step needs to be re-run for each function but there are some tricky bits that should not happen each run. That is, most of AsmPrinter is idempotent but not all. Label names are a big issue. A simple label counter (L0, L1, etc.) is no longer sufficent because the counter gets reset between invocations and you end up with multiple labels with the same name in the .s file. We got around this by including the (mangled) function name in the label name. I had to tweak the mangling code a bit so that it would generate valid label names. I also consolidated it as there were at least two different implementations in the ~2.5 codebase. I don't know if that's changed. We don't use much of opt at all. I'm sure there are some issues with the interprocedural optimizations. We didn't deal with those. All of our changes are in the llc/codegen piece. As for getting it upstream, we're moving to 3.1 as soon as it's ready and my intention is to push as much of our customized code upstream as possible during that transition. The above work would be a pretty high priority as it is a major source of conflicts for us and I'd rather just git rid of those. :) So expect to start seeing something within 1-2 months. Unfortunately, we have bureaucratic processes I have to go through here to get stuff approved for public release.> Where exactly do you see problems with this intrinsic? It is not meant > to block further work in heterogeneous computing, but to allow us to > gradually improve LLVM to gain such features. It especially provides a > low overhead solution that adds working heterogeneous compute > capabilities for major GPU targets to LLVM. This working solution can > prepare the ground for closer integrated solutions.It feels like a code generator bolted onto the side of opt, llc, etc. with all of the details that involves. It seems much easier to me to just go through the "real" code generator. -Dave
Possibly Parallel Threads
- [LLVMdev] [PATCH][RFC] Add llvm.codegen Intrinsic To Support Embedded LLVM IR Code Generation
- [LLVMdev] [PATCH][RFC] Add llvm.codegen Intrinsic To Support Embedded LLVM IR Code Generation
- [LLVMdev] [PATCH][RFC] Add llvm.codegen Intrinsic To Support Embedded LLVM IR Code Generation
- [LLVMdev] [PATCH][RFC] Add llvm.codegen Intrinsic To Support Embedded LLVM IR Code Generation
- [LLVMdev] [PATCH][RFC] Add llvm.codegen Intrinsic To Support Embedded LLVM IR Code Generation