Johannes Doerfert via llvm-dev
2020-Jul-28 06:00 UTC
[llvm-dev] [RFC] Heterogeneous LLVM-IR Modules
TL;DR ----- Let's allow to merge to LLVM-IR modules for different targets (with compatible data layouts) into a single LLVM-IR module to facilitate host-device code optimizations. Wait, what? ----------- Given an offloading programming model of your choice (CUDA, HIP, SYCL, OpenMP, OpenACC, ...), the current pipeline will most likely optimize the host and the device code in isolation. This is problematic as it makes everything from simple constant propagation to kernel splitting/fusion painfully hard. The proposal is to merge host and device code in a single module during the optimization steps. This should not induce any cost (if people don't use the functionality). But how do heterogeneous modules help? -------------------------------------- Assuming we have heterogeneous LLVM-IR modules we can look at accelerator code optimization as an interprocedural optimization problem. You basically call the "kernel" but you cannot inline it. So you know the call site(s) and arguments, can propagate information back and forth (=constants, attributes, ...), and modify the call site as well as the kernel simultaneously, e.g., to split the kernel or fuse consecutive kernels. Without heterogeneous LLVM-IR modules we can do all of this, but require a lot more machinery. Given abstract call sites [0,1] and enabled interprocedural optimizations [2], host-device optimizations inside a heterogeneous module are really not (much) different than any other interprocedural optimization. [0] https://llvm.org/docs/LangRef.html#callback-metadata [1] https://youtu.be/zfiHaPaoQPc [2] https://youtu.be/CzWkc_JcfS0 Where are the details? ---------------------- This is merely a proposal to get feedback. I talked to people before and got mixed results. I think this can be done in an "opt-in" way that is non-disruptive and without penalty. I sketched some ideas in [3] but *THIS IS NOT A PROPER PATCH*. If there is interest, I will provide more thoughts on design choices and potential problems. Since there is not much, I was hoping this would be a community effort from the very beginning :) [3] https://reviews.llvm.org/D84728 But MLIR, ... ------------- I imagine MLIR can be used for this and there are probably good reasons to do so. We might not want to *only* to do it there with mainly the same arguments other things are still developed on LLVM-IR level. Feel free to ask though :) Thanks, Johannes
Mehdi AMINI via llvm-dev
2020-Jul-28 18:03 UTC
[llvm-dev] [RFC] Heterogeneous LLVM-IR Modules
Hi, Heterogeneous modules seem like an important feature when targeting accelerators. On Mon, Jul 27, 2020 at 11:01 PM Johannes Doerfert via llvm-dev < llvm-dev at lists.llvm.org> wrote:> TL;DR > ----- > > Let's allow to merge to LLVM-IR modules for different targets (with > compatible data layouts) into a single LLVM-IR module to facilitate > host-device code optimizations. >I think the main question I have is with respect to this limitation on the datalayout: isn't it too limiting in practice? I understand that this is much easier to implement in LLVM today, but it may get us into a fairly limited place in terms of what can be supported in the future. Have you looked into what would it take to have heterogeneous modules that have their own DL?> > > Wait, what? > ----------- > > Given an offloading programming model of your choice (CUDA, HIP, SYCL, > OpenMP, OpenACC, ...), the current pipeline will most likely optimize > the host and the device code in isolation. This is problematic as it > makes everything from simple constant propagation to kernel > splitting/fusion painfully hard. The proposal is to merge host and > device code in a single module during the optimization steps. This > should not induce any cost (if people don't use the functionality). > > > But how do heterogeneous modules help? > -------------------------------------- > > Assuming we have heterogeneous LLVM-IR modules we can look at > accelerator code optimization as an interprocedural optimization > problem. You basically call the "kernel" but you cannot inline it. So > you know the call site(s) and arguments, can propagate information back > and forth (=constants, attributes, ...), and modify the call site as > well as the kernel simultaneously, e.g., to split the kernel or fuse > consecutive kernels. Without heterogeneous LLVM-IR modules we can do all > of this, but require a lot more machinery. Given abstract call sites > [0,1] and enabled interprocedural optimizations [2], host-device > optimizations inside a heterogeneous module are really not (much) > different than any other interprocedural optimization. > > [0] https://llvm.org/docs/LangRef.html#callback-metadata > [1] https://youtu.be/zfiHaPaoQPc > [2] https://youtu.be/CzWkc_JcfS0 > > > Where are the details? > ---------------------- > > This is merely a proposal to get feedback. I talked to people before and > got mixed results. I think this can be done in an "opt-in" way that is > non-disruptive and without penalty. I sketched some ideas in [3] but > *THIS IS NOT A PROPER PATCH*. If there is interest, I will provide more > thoughts on design choices and potential problems. Since there is not > much, I was hoping this would be a community effort from the very > beginning :) > > [3] https://reviews.llvm.org/D84728 > > > But MLIR, ... > ------------- > > I imagine MLIR can be used for this and there are probably good reasons > to do so. We might not want to *only* to do it there with mainly the > same arguments other things are still developed on LLVM-IR level. Feel > free to ask though :)(+1 : MLIR is not intended to be a reason to not improve LLVM!) -- Mehdi -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20200728/f51a4b86/attachment.html>
Johannes Doerfert via llvm-dev
2020-Jul-28 19:05 UTC
[llvm-dev] [RFC] Heterogeneous LLVM-IR Modules
[I removed all but the data layout question, that is an important topic] On 7/28/20 1:03 PM, Mehdi AMINI wrote: > TL;DR >> ----- >> >> Let's allow to merge to LLVM-IR modules for different targets (with >> compatible data layouts) into a single LLVM-IR module to facilitate >> host-device code optimizations. >> > > I think the main question I have is with respect to this limitation on the > datalayout: isn't it too limiting in practice? > I understand that this is much easier to implement in LLVM today, but it > may get us into a fairly limited place in terms of what can be supported in > the future. > Have you looked into what would it take to have heterogeneous modules that > have their own DL? Let me share some thoughts on the data layouts situation, not all of which are fully matured but I guess we have to start somewhere: If we look at the host-device interface there has to be some agreement on parts of the datalayout, namely what the data looks like the host sends over and expects back. If I'm not mistaken, GPUs will match the host in things like padding, endianness, etc. because you cannot translate things "on the fly". That said, here might be additional "address spaces" on either side that the other one is not matching/aware of. Long story short, I think host & device need to, and in practice do, agree on the data layout of the address space they use to communicate. The above is for me a strong hint that we could use address spaces to identify/distinguish differences when we link the modules. However, there might be the case that this is not sufficient, e.g., if the default alloca address space differs. In that case I don't see a reason to not pull the same "trick" as with the triple. We can specify additional data layouts, one per device, and if you retrieve the data layout, or triple, you need to pass a global symbol as a "anchor". For all intraprocedural passes this should be sufficient as they are only interested in the DL and triple of the function they look at. For IPOs we have to distinguish the ones that know about the host-device calls and the ones that don't. We might have to teach all of them about these calls but as long as they are callbacks through a driver routine I don't even think we need to. I'm curious if you or others see an immediate problem with both a device specific DL and triple (optionally) associated with every global symbol. ~ Johannes
David Chisnall via llvm-dev
2020-Jul-30 11:01 UTC
[llvm-dev] [RFC] Heterogeneous LLVM-IR Modules
On 28/07/2020 07:00, Johannes Doerfert via llvm-dev wrote:> TL;DR > ----- > > Let's allow to merge to LLVM-IR modules for different targets (with > compatible data layouts) into a single LLVM-IR module to facilitate > host-device code optimizations.I think it's worth taking a step back here and thinking through the problem. The proposed solution makes me nervous because it is quite a significant change to the compiler flow that comes from thinking of heterogeneous optimisation as an fat LTO problem, when to me it feels more like a thin LTO problem. At the moment, there's an implicit assumption that everything in a Module will flow to the same CodeGen back end. It can make global assumptions about cost models, can inline everything, and so on. It sounds as if we have a couple of use cases: - Analysis flow between modules - Transforms that modify two modules The first case is where the motivating example of constant propagation. This feels like the right approach is something like ThinLTO, where you can collect in one module the fact that a kernel is invoked only with specific constant arguments in the host module and consume that result in the target module. The second example is what you'd need for things like kernel fusion, where you need to both combine two kernels in the target module and also modify the callers to invoke the single kernel and skip some data flow. For this, you need a kind of pass that can work over things that begin in two modules. It seems that a less invasive change would be: - Use ThinLTO metadata for the first case, extend it as required. - Add a new kind of ModuleSetPass that takes a set of Modules and is allowed to modify both. This avoids any modifications for the common (single-target) case, but should give you the required functionality. Am I missing something? David
Johannes Doerfert via llvm-dev
2020-Jul-30 12:57 UTC
[llvm-dev] [RFC] Heterogeneous LLVM-IR Modules
[off topic] I'm not a fan of the "reply-to-list" default. Thanks for the feedback! More below. On 7/30/20 6:01 AM, David Chisnall via llvm-dev wrote:> On 28/07/2020 07:00, Johannes Doerfert via llvm-dev wrote: >> TL;DR >> ----- >> >> Let's allow to merge to LLVM-IR modules for different targets (with >> compatible data layouts) into a single LLVM-IR module to facilitate >> host-device code optimizations. > > I think it's worth taking a step back here and thinking through the > problem. The proposed solution makes me nervous because it is quite a > significant change to the compiler flow that comes from thinking of > heterogeneous optimisation as an fat LTO problem, when to me it feels > more like a thin LTO problem. > > At the moment, there's an implicit assumption that everything in a > Module will flow to the same CodeGen back end. It can make global > assumptions about cost models, can inline everything, and so on. >FWIW, I would expect that we split the module *before* the codegen stage such that the back end doesn't have to deal with heterogeneous models (right now). I'm not sure about cost models and such though. As far as I know, we don't do global decisions anywhere but I might be wrong. Put differently, I hope we don't do global decisions as it seems quite easy to disturb the result with unrelated code changes.> It sounds as if we have a couple of use cases: > > - Analysis flow between modules > - Transforms that modify two modules >Yes! Notably the first bullet is bi-directional and cyclic ;)> The first case is where the motivating example of constant > propagation. This feels like the right approach is something like > ThinLTO, where you can collect in one module the fact that a kernel is > invoked only with specific constant arguments in the host module and > consume that result in the target module. >Except that you can have cyclic dependencies which makes this problematic again. You might not propagate constants from the device module to the host one, but if memory is only read/written on the device is very interesting on the host side. You can avoid memory copies, remove globals, etc. That is just what comes to mind right away. The proposed heterogeneous modules should not limit you to "monolithic LTO", or "thin LTO" for that matter.> The second example is what you'd need for things like kernel fusion, > where you need to both combine two kernels in the target module and > also modify the callers to invoke the single kernel and skip some data > flow. For this, you need a kind of pass that can work over things that > begin in two modules. >Right. Splitting, fusing, moving code, etc. all require you to modify both modules at the same time. Even if you only modify one module, you want information from both, either direction.> It seems that a less invasive change would be: > > - Use ThinLTO metadata for the first case, extend it as required. > - Add a new kind of ModuleSetPass that takes a set of Modules and is > allowed to modify both. > > This avoids any modifications for the common (single-target) case, but > should give you the required functionality. Am I missing something? >This is similar to what Renato suggested early on. In addition to the "ThinLTO metadata" inefficiencies outlined above, the problem I have with the second part is that it requires to write completely new passes in a different style than anything we have. It is certainly a possibility but we can probably do it without any changes to the infrastructure. In addition to the analysis/optimization infrastructure reasons I would like to point out that this would make our toolchains a lot easier. We have some embedding of device code in host code right now (on every level) and things like LTO for all offloading models would become much easier if we distribute the heterogeneous modules instead of yet another embedding. I might be biased by the way "clang offload bundler" is used right now for OpenMP, HIP, etc. but I would very much like to replace that with a "clean" toolchain that performs as much LTO as possible, at least for the accelerator code. I hope this makes some sense, feel free to ask questions :) ~ Johannes> David > > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev