Johannes Doerfert via llvm-dev
2020-Jul-28 20:50 UTC
[llvm-dev] [RFC] Heterogeneous LLVM-IR Modules
On 7/28/20 3:03 PM, Renato Golin wrote: > On Tue, 28 Jul 2020 at 20:44, Johannes Doerfert > <johannesdoerfert at gmail.com> wrote: >> What I (tried to) describe is that you can pass an array of structs via >> a CUDA memcpy (or similar) to the device and then expect it to be >> accessible as an array of structs on the other side. I can imagine this >> property doesn't hold for *every* programming model, but the question is >> if we need to support the ones that don't have it. FWIW, I don't know if >> it is worth to build up a system that can allow this property to be >> missing or if it is better to not allow such systems to opt-in to the >> heterogeneous module merging. I guess we would need to list the >> programming models for which you cannot reasonably expect the above to >> work. > > Right, this is the can of worms I think we won't see before it hits > us. My main concern is that allowing for opaque interfaces to be > defined means we'll be able to do almost anything around such simple > constraints, and the code won't be heavily tested around it (because > it's really hard to test those constraints). > > For example, one constraint is: functions that cross the DL barrier > (ie. call functions in other DL) must marshall the arguments in a way > that the size in bytes is exactly what the function expects, given its > DL. > > This is somewhat easy to verify, but it's not enough to guarantee that > the alignment of internal elements, structure layout, padding, etc > make sense in the target. Unless we write code that pack/unpack, we > cannot guarantee it is what we expect. And writing unpack code in GPU > may not even be meaningful. And it can change from one GPU family to > another, or one API to another. > > Makes sense? Kind of. I get the theoretical concern but I am questioning if we need to support that at all. What I try to say is that for the programming models I am aware of this is not a concern to begin with. The accelerator actually matches the host data layout. Let's take OpenMP. The compiler cannot know what your memory actually is because types are, you know, just hints for the most part. So we need the devices to match the host data layout wrt. padding, alignment, etc. or we could not copy an array of structs from one to the other and expect it to work. CUDA, HIP, SYCL, ... should all be the same. I hope someone corrects me if I have some misconceptions here :) >> I think that a multi-DL + multi-triple design seems like a good >> candidate. > > I agree. Multiple-DL is something that comes and goes in the community > and so far the "consensus" has been that data layout is hard enough as > it is. I've always been keen on having it, but not keen on making it > happen (and fixing all the bugs that will come with it). :D > > Another problem we haven't even considered is where the triple will > come from and in which form. As you know, triples don't usually mean > anything without further context, and that context isn't present in > the triple or the DL. They're lowered from the front-end in snippets > of code (pack/unpack, shift/mask, pad/store/pass pointer) or thunks > (EH, default class methods). > > Once it's lowered, fine, DL should be mostly fine because everything > will be lowered anyway. But how will the user identify code from > multiple different front-ends on the same IR module? If we restrict > ourselves with a single front-end, then we'll need one front-end to > rule them all, and that would be counter productive (and fairly > limited scope for such a large change). > > I fear the infrastructure issues around getting the code inside the > module will be more complicated (potentially intractable) than once we > have a multi-DL module to deal with... > > >> I am in doubt about the "simpler" part but it's an option. > > That's an understatement. :) > > But I think it's important to understand why, only if to make > multiple-DL modules more appealing. Fair. And I'm open to be convinced this is the right approach after all. >> The one disadvantage I see is that we have to change the way passes work in this >> setting versus the single module setting. > > Passes will already have to change, as they can't look on every > function or every call, if they're done to a different DL. Probably a > simpler change, though. Again, I'm not so sure. As long as the interface is opaque, e.g., calls from host go through a driver API, I doubt there is really a problem. I imagine it somehow like OpenMP looks, here my conceptual model: --- // Host static char KERNEL_ID; void offload(void *payload) { driver_invoke_kernel(&KERNEL_ID, payload); } __attribute__((callback (cb_fn, arg))) void driver_invoke_kernel(void(*cb_fn)(void *), void *arg); // Device device 1 global alias KERNEL_ID = kernel; // "LLVM-IR-like syntax" device 1 void kernel(void *payload) { auto *typed_payload = static_cast<...>(payload); ... } --- The "important" part is there is no direct call edge between the two modules. If we want to support that, we can have that discussion for sure. I initially assumed we might not need/want to allow that. If not, I don't see how any interpretation of the IR could lead to a problem. Every pass, including IPO, can go off an optimize as long as they are not callback and device ID aware they will not make the connection from `offload` to `kernel`. Again, I might not know the models that have a direct call edge or we want to prepare for them. In those cases we need to care for sure. ~ Johannes
Renato Golin via llvm-dev
2020-Jul-28 23:13 UTC
[llvm-dev] [RFC] Heterogeneous LLVM-IR Modules
On Tue, 28 Jul 2020 at 21:52, Johannes Doerfert <johannesdoerfert at gmail.com> wrote:> Let's take OpenMP. > The compiler cannot know what your memory actually is because types are, > you know, just hints for the most part. So we need the devices to match > the host data layout wrt. padding, alignment, etc. or we could not copy > an array of structs from one to the other and expect it to work. CUDA, > HIP, SYCL, ... should all be the same. I hope someone corrects me if I > have some misconceptions here :)All those programming models have already been made to inter-work with CPUs like that. So, if we take the conscious decision that accelerators' drivers must implement that transparent layer in order to benefit from LLVM IR's multi-DL, fine. I have no stakes in any particular accelerator, but we should make it clear that they must implement that level of transparency to use this feature of LLVM IR.> The "important" part is there is no direct call edge between the two > modules.Right! This makes it a lot simpler. We just need to annotate each global symbol with the right DL and trust that the lowering was done properly. What about optimisation passes? GPU code skips most of the CPU pipeline not to break codegen later on, but AFAIK, this is done by registering a new pass manager. We'd need to teach passes (or the pass manager) to not throw accelerator code into the CPU pipeline and vice-versa. --renato
Johannes Doerfert via llvm-dev
2020-Jul-29 03:26 UTC
[llvm-dev] [RFC] Heterogeneous LLVM-IR Modules
On 7/28/20 6:13 PM, Renato Golin wrote:> On Tue, 28 Jul 2020 at 21:52, Johannes Doerfert > <johannesdoerfert at gmail.com> wrote: >> Let's take OpenMP. >> The compiler cannot know what your memory actually is because types are, >> you know, just hints for the most part. So we need the devices to match >> the host data layout wrt. padding, alignment, etc. or we could not copy >> an array of structs from one to the other and expect it to work. CUDA, >> HIP, SYCL, ... should all be the same. I hope someone corrects me if I >> have some misconceptions here :) > All those programming models have already been made to inter-work with > CPUs like that. So, if we take the conscious decision that > accelerators' drivers must implement that transparent layer in order > to benefit from LLVM IR's multi-DL, fine. > > I have no stakes in any particular accelerator, but we should make it > clear that they must implement that level of transparency to use this > feature of LLVM IR.Yes. Whatever we do, it should be clear what requirements there are for you to create a multi-target module. We can probably even verify some of them, like the direct call edge thing.>> The "important" part is there is no direct call edge between the two >> modules. > Right! This makes it a lot simpler. We just need to annotate each > global symbol with the right DL and trust that the lowering was done > properly. > > What about optimisation passes? GPU code skips most of the CPU > pipeline not to break codegen later on, but AFAIK, this is done by > registering a new pass manager.That is an interesting point. We could arguably teach the (new) PM to run different pipelines for the different devices. FWIW, I'm not even sure we do that right now, e.g., for CUDA compilation. [long live uniformity!]> We'd need to teach passes (or the pass manager) to not throw > accelerator code into the CPU pipeline and vice-versa.What do you mean by accelerator code? Intrinsics, vector length, etc. should be controlled by the triple, so that should be handled. ~ Johannes> > --renato