thr3ads.net - llvm dev - [llvm-dev] [RFC] Heterogeneous LLVM-IR Modules [Jul 2020]

If this information is useful, please help other people find it:
Share via:

Johannes Doerfert via llvm-dev

2020-Jul-28 06:00 UTC

[llvm-dev] [RFC] Heterogeneous LLVM-IR Modules

TL;DR
-----

Let's allow to merge to LLVM-IR modules for different targets (with
compatible data layouts) into a single LLVM-IR module to facilitate
host-device code optimizations.


Wait, what?
-----------

Given an offloading programming model of your choice (CUDA, HIP, SYCL,
OpenMP, OpenACC, ...), the current pipeline will most likely optimize
the host and the device code in isolation. This is problematic as it
makes everything from simple constant propagation to kernel
splitting/fusion painfully hard. The proposal is to merge host and
device code in a single module during the optimization steps. This
should not induce any cost (if people don't use the functionality).


But how do heterogeneous modules help?
--------------------------------------

Assuming we have heterogeneous LLVM-IR modules we can look at
accelerator code optimization as an interprocedural optimization
problem. You basically call the "kernel" but you cannot inline it. So
you know the call site(s) and arguments, can propagate information back
and forth (=constants, attributes, ...), and modify the call site as
well as the kernel simultaneously, e.g., to split the kernel or fuse
consecutive kernels. Without heterogeneous LLVM-IR modules we can do all
of this, but require a lot more machinery. Given abstract call sites
[0,1] and enabled interprocedural optimizations [2], host-device
optimizations inside a heterogeneous module are really not (much)
different than any other interprocedural optimization.

[0] https://llvm.org/docs/LangRef.html#callback-metadata
[1] https://youtu.be/zfiHaPaoQPc
[2] https://youtu.be/CzWkc_JcfS0


Where are the details?
----------------------

This is merely a proposal to get feedback. I talked to people before and
got mixed results. I think this can be done in an "opt-in" way that is
non-disruptive and without penalty. I sketched some ideas in [3] but
*THIS IS NOT A PROPER PATCH*. If there is interest, I will provide more
thoughts on design choices and potential problems. Since there is not
much, I was hoping this would be a community effort from the very
beginning :)

[3] https://reviews.llvm.org/D84728


But MLIR, ...
-------------

I imagine MLIR can be used for this and there are probably good reasons
to do so. We might not want to *only* to do it there with mainly the
same arguments other things are still developed on LLVM-IR level. Feel
free to ask though :)


Thanks,
   Johannes

Mehdi AMINI via llvm-dev

2020-Jul-28 18:03 UTC

head link

[llvm-dev] [RFC] Heterogeneous LLVM-IR Modules

Hi,

Heterogeneous modules seem like an important feature when
targeting accelerators.

On Mon, Jul 27, 2020 at 11:01 PM Johannes Doerfert via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
> TL;DR
> -----
>
> Let's allow to merge to LLVM-IR modules for different targets (with
> compatible data layouts) into a single LLVM-IR module to facilitate
> host-device code optimizations.
>
I think the main question I have is with respect to this limitation on the
datalayout: isn't it too limiting in practice?
I understand that this is much easier to implement in LLVM today, but it
may get us into a fairly limited place in terms of what can be supported in
the future.
Have you looked into what would it take to have heterogeneous modules that
have their own DL?

>
>
> Wait, what?
> -----------
>
> Given an offloading programming model of your choice (CUDA, HIP, SYCL,
> OpenMP, OpenACC, ...), the current pipeline will most likely optimize
> the host and the device code in isolation. This is problematic as it
> makes everything from simple constant propagation to kernel
> splitting/fusion painfully hard. The proposal is to merge host and
> device code in a single module during the optimization steps. This
> should not induce any cost (if people don't use the functionality).
>
>
> But how do heterogeneous modules help?
> --------------------------------------
>
> Assuming we have heterogeneous LLVM-IR modules we can look at
> accelerator code optimization as an interprocedural optimization
> problem. You basically call the "kernel" but you cannot inline
it. So
> you know the call site(s) and arguments, can propagate information back
> and forth (=constants, attributes, ...), and modify the call site as
> well as the kernel simultaneously, e.g., to split the kernel or fuse
> consecutive kernels. Without heterogeneous LLVM-IR modules we can do all
> of this, but require a lot more machinery. Given abstract call sites
> [0,1] and enabled interprocedural optimizations [2], host-device
> optimizations inside a heterogeneous module are really not (much)
> different than any other interprocedural optimization.
>
> [0] https://llvm.org/docs/LangRef.html#callback-metadata
> [1] https://youtu.be/zfiHaPaoQPc
> [2] https://youtu.be/CzWkc_JcfS0
>
>
> Where are the details?
> ----------------------
>
> This is merely a proposal to get feedback. I talked to people before and
> got mixed results. I think this can be done in an "opt-in" way
that is
> non-disruptive and without penalty. I sketched some ideas in [3] but
> *THIS IS NOT A PROPER PATCH*. If there is interest, I will provide more
> thoughts on design choices and potential problems. Since there is not
> much, I was hoping this would be a community effort from the very
> beginning :)
>
> [3] https://reviews.llvm.org/D84728
>
>
> But MLIR, ...
> -------------
>
> I imagine MLIR can be used for this and there are probably good reasons
> to do so. We might not want to *only* to do it there with mainly the
> same arguments other things are still developed on LLVM-IR level. Feel
> free to ask though :)

(+1 : MLIR is not intended to be a reason to not improve LLVM!)

-- 
Mehdi
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20200728/f51a4b86/attachment.html>

Johannes Doerfert via llvm-dev

2020-Jul-28 19:05 UTC

head link

[llvm-dev] [RFC] Heterogeneous LLVM-IR Modules

[I removed all but the data layout question, that is an important topic]
On 7/28/20 1:03 PM, Mehdi AMINI wrote:
 > TL;DR
 >> -----
 >>
 >> Let's allow to merge to LLVM-IR modules for different targets
(with
 >> compatible data layouts) into a single LLVM-IR module to facilitate
 >> host-device code optimizations.
 >>
 >
 > I think the main question I have is with respect to this limitation 
on the
 > datalayout: isn't it too limiting in practice?
 > I understand that this is much easier to implement in LLVM today, but it
 > may get us into a fairly limited place in terms of what can be 
supported in
 > the future.
 > Have you looked into what would it take to have heterogeneous modules 
that
 > have their own DL?

Let me share some thoughts on the data layouts situation, not all of 
which are
fully matured but I guess we have to start somewhere:

If we look at the host-device interface there has to be some agreement
on parts of the datalayout, namely what the data looks like the host
sends over and expects back. If I'm not mistaken, GPUs will match the
host in things like padding, endianness, etc. because you cannot
translate things "on the fly". That said, here might be additional
"address spaces" on either side that the other one is not
matching/aware
of. Long story short, I think host & device need to, and in practice do,
agree on the data layout of the address space they use to communicate.

The above is for me a strong hint that we could use address spaces to
identify/distinguish differences when we link the modules. However,
there might be the case that this is not sufficient, e.g., if the
default alloca address space differs. In that case I don't see a reason
to not pull the same "trick" as with the triple. We can specify
additional data layouts, one per device, and if you retrieve the data
layout, or triple, you need to pass a global symbol as a "anchor". For
all intraprocedural passes this should be sufficient as they are only
interested in the DL and triple of the function they look at. For IPOs
we have to distinguish the ones that know about the host-device calls
and the ones that don't. We might have to teach all of them about these
calls but as long as they are callbacks through a driver routine I don't
even think we need to.

I'm curious if you or others see an immediate problem with both a device
specific DL and triple (optionally) associated with every global symbol.

~ Johannes

David Chisnall via llvm-dev

2020-Jul-30 11:01 UTC

head link

[llvm-dev] [RFC] Heterogeneous LLVM-IR Modules

On 28/07/2020 07:00, Johannes Doerfert via llvm-dev
wrote:> TL;DR
> -----
> 
> Let's allow to merge to LLVM-IR modules for different targets (with
> compatible data layouts) into a single LLVM-IR module to facilitate
> host-device code optimizations.
I think it's worth taking a step back here and thinking through the 
problem.  The proposed solution makes me nervous because it is quite a 
significant change to the compiler flow that comes from thinking of 
heterogeneous optimisation as an fat LTO problem, when to me it feels 
more like a thin LTO problem.

At the moment, there's an implicit assumption that everything in a 
Module will flow to the same CodeGen back end.  It can make global 
assumptions about cost models, can inline everything, and so on.

It sounds as if we have a couple of use cases:

  - Analysis flow between modules
  - Transforms that modify two modules

The first case is where the motivating example of constant propagation. 
This feels like the right approach is something like ThinLTO, where you 
can collect in one module the fact that a kernel is invoked only with 
specific constant arguments in the host module and consume that result 
in the target module.

The second example is what you'd need for things like kernel fusion, 
where you need to both combine two kernels in the target module and also 
modify the callers to invoke the single kernel and skip some data flow. 
For this, you need a kind of pass that can work over things that begin 
in two modules.

It seems that a less invasive change would be:

  - Use ThinLTO metadata for the first case, extend it as required.
  - Add a new kind of ModuleSetPass that takes a set of Modules and is 
allowed to modify both.

This avoids any modifications for the common (single-target) case, but 
should give you the required functionality.  Am I missing something?

David

Johannes Doerfert via llvm-dev

2020-Jul-30 12:57 UTC

head link

[llvm-dev] [RFC] Heterogeneous LLVM-IR Modules

[off topic] I'm not a fan of the "reply-to-list" default.


Thanks for the feedback! More below.


On 7/30/20 6:01 AM, David Chisnall via llvm-dev wrote:> On 28/07/2020 07:00, Johannes Doerfert via llvm-dev wrote:
>> TL;DR
>> -----
>>
>> Let's allow to merge to LLVM-IR modules for different targets (with
>> compatible data layouts) into a single LLVM-IR module to facilitate
>> host-device code optimizations.
>
> I think it's worth taking a step back here and thinking through the 
> problem.  The proposed solution makes me nervous because it is quite a 
> significant change to the compiler flow that comes from thinking of 
> heterogeneous optimisation as an fat LTO problem, when to me it feels 
> more like a thin LTO problem.
>
> At the moment, there's an implicit assumption that everything in a 
> Module will flow to the same CodeGen back end.  It can make global 
> assumptions about cost models, can inline everything, and so on.
>FWIW, I would expect that we split the module *before* the codegen stage 
such that the back end doesn't have to deal with heterogeneous models 
(right now).

I'm not sure about cost models and such though. As far as I know, we 
don't do global decisions anywhere but I might be wrong. Put 
differently, I hope we don't do global decisions as it seems quite easy 
to disturb the result with unrelated code changes.

> It sounds as if we have a couple of use cases:
>
>  - Analysis flow between modules
>  - Transforms that modify two modules
>Yes! Notably the first bullet is bi-directional and cyclic ;)

> The first case is where the motivating example of constant 
> propagation. This feels like the right approach is something like 
> ThinLTO, where you can collect in one module the fact that a kernel is 
> invoked only with specific constant arguments in the host module and 
> consume that result in the target module.
>Except that you can have cyclic dependencies which makes this 
problematic again. You might not propagate constants from the device 
module to the host one, but if memory is only read/written on the device 
is very interesting on the host side. You can avoid memory copies, 
remove globals, etc. That is just what comes to mind right away. The 
proposed heterogeneous modules should not limit you to "monolithic
LTO",
or "thin LTO" for that matter.

> The second example is what you'd need for things like kernel fusion, 
> where you need to both combine two kernels in the target module and 
> also modify the callers to invoke the single kernel and skip some data 
> flow. For this, you need a kind of pass that can work over things that 
> begin in two modules.
>Right. Splitting, fusing, moving code, etc. all require you to modify 
both modules at the same time. Even if you only modify one module, you 
want information from both, either direction.

> It seems that a less invasive change would be:
>
>  - Use ThinLTO metadata for the first case, extend it as required.
>  - Add a new kind of ModuleSetPass that takes a set of Modules and is 
> allowed to modify both.
>
> This avoids any modifications for the common (single-target) case, but 
> should give you the required functionality.  Am I missing something?
>This is similar to what Renato suggested early on. In addition to the 
"ThinLTO metadata" inefficiencies outlined above, the problem I have 
with the second part is that it requires to write completely new passes 
in a different style than anything we have. It is certainly a 
possibility but we can probably do it without any changes to the 
infrastructure.

In addition to the analysis/optimization infrastructure reasons I would 
like to point out that this would make our toolchains a lot easier. We 
have some embedding of device code in host code right now (on every 
level) and things like LTO for all offloading models would become much 
easier if we distribute the heterogeneous modules instead of yet another 
embedding. I might be biased by the way "clang offload bundler" is
used
right now for OpenMP, HIP, etc. but I would very much like to replace 
that with a "clean" toolchain that performs as much LTO as possible,
at
least for the accelerator code.

I hope this makes some sense, feel free to ask questions :)


~ Johannes


> David
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

Reasonably Related Threads

Search for more possibly parallel threads

llvm dev - Jul 2020 - [RFC] Heterogeneous LLVM-IR Modules

[llvm-dev] [RFC] Heterogeneous LLVM-IR Modules

[llvm-dev] [RFC] Heterogeneous LLVM-IR Modules

[llvm-dev] [RFC] Heterogeneous LLVM-IR Modules

[llvm-dev] [RFC] Heterogeneous LLVM-IR Modules

[llvm-dev] [RFC] Heterogeneous LLVM-IR Modules

Reasonably Related Threads