thr3ads.net - llvm dev - [llvm-dev] [RFC] Heterogeneous LLVM-IR Modules [Jul 2020]

If this information is useful, please help other people find it:
Share via:

Johannes Doerfert via llvm-dev

2020-Jul-28 19:05 UTC

[llvm-dev] [RFC] Heterogeneous LLVM-IR Modules

[I removed all but the data layout question, that is an important topic]
On 7/28/20 1:03 PM, Mehdi AMINI wrote:
 > TL;DR
 >> -----
 >>
 >> Let's allow to merge to LLVM-IR modules for different targets
(with
 >> compatible data layouts) into a single LLVM-IR module to facilitate
 >> host-device code optimizations.
 >>
 >
 > I think the main question I have is with respect to this limitation 
on the
 > datalayout: isn't it too limiting in practice?
 > I understand that this is much easier to implement in LLVM today, but it
 > may get us into a fairly limited place in terms of what can be 
supported in
 > the future.
 > Have you looked into what would it take to have heterogeneous modules 
that
 > have their own DL?

Let me share some thoughts on the data layouts situation, not all of 
which are
fully matured but I guess we have to start somewhere:

If we look at the host-device interface there has to be some agreement
on parts of the datalayout, namely what the data looks like the host
sends over and expects back. If I'm not mistaken, GPUs will match the
host in things like padding, endianness, etc. because you cannot
translate things "on the fly". That said, here might be additional
"address spaces" on either side that the other one is not
matching/aware
of. Long story short, I think host & device need to, and in practice do,
agree on the data layout of the address space they use to communicate.

The above is for me a strong hint that we could use address spaces to
identify/distinguish differences when we link the modules. However,
there might be the case that this is not sufficient, e.g., if the
default alloca address space differs. In that case I don't see a reason
to not pull the same "trick" as with the triple. We can specify
additional data layouts, one per device, and if you retrieve the data
layout, or triple, you need to pass a global symbol as a "anchor". For
all intraprocedural passes this should be sufficient as they are only
interested in the DL and triple of the function they look at. For IPOs
we have to distinguish the ones that know about the host-device calls
and the ones that don't. We might have to teach all of them about these
calls but as long as they are callbacks through a driver routine I don't
even think we need to.

I'm curious if you or others see an immediate problem with both a device
specific DL and triple (optionally) associated with every global symbol.

~ Johannes

Renato Golin via llvm-dev

2020-Jul-28 19:24 UTC

head link

[llvm-dev] [RFC] Heterogeneous LLVM-IR Modules

On Tue, 28 Jul 2020 at 20:07, Johannes Doerfert via llvm-dev
<llvm-dev at lists.llvm.org> wrote:> Long story short, I think host & device need to, and in practice do,
> agree on the data layout of the address space they use to communicate.
You can design APIs that call functions into external hardware that
have completely different data layout, you just need to properly pack
and unpack the arguments and results. IIUC, that's what you call
"agree on the DL"?

In an LLVM module, with the single-DL requirement, this wouldn't work.
But if we had multiple named DLs and attributes to functions and
globals tagged with those DLs, then you could have multiple DLs on the
same module, as long as their control flow never reaches the other
(only through specific API calls), it should be "fine". However, this
is hardly well defined and home to unlimited corner cases to handle.
Using namespaces would work for addresses, but other type sizes and
alignment would have to be defined anyway, then we're back to the
multiple-DL tags scenario.

Given that we're not allowing them to inline or interact, I wonder if
a "simpler" approach would be to allow more than one module per
"compile unit"? Those are some very strong quotes, mind you, but it
would "solve" the DL problem entirely. Since both modules are in
memory, perhaps even passing through different pipelines (CPU, GPU,
FPGA), we can do constant propagation, kernel specialisation and
strong DCE by identifying the contact points, but still treating them
as separate modules. In essence, it would be the same as having them
on the same module, but without having to juggle function attributes
and data layout compatibility issues.

The big question is, obviously, how many things would break if we had
two or more modules live at the same time. Global contexts would have
to be rewritten, but if each module passes on their own optimisation
pipelines, then the hardest part would be building the bridge between
them (call graph and other analysis) and keep that up-to-date as all
modules walk through their pipelines, so that passes like constant
propagation can "see" through the module barrier.

cheers,
--renato

Mehdi AMINI via llvm-dev

2020-Jul-28 19:25 UTC

head link

[llvm-dev] [RFC] Heterogeneous LLVM-IR Modules

On Tue, Jul 28, 2020 at 12:07 PM Johannes Doerfert <
johannesdoerfert at gmail.com> wrote:
> [I removed all but the data layout question, that is an important topic]
> On 7/28/20 1:03 PM, Mehdi AMINI wrote:
>  > TL;DR
>  >> -----
>  >>
>  >> Let's allow to merge to LLVM-IR modules for different targets
(with
>  >> compatible data layouts) into a single LLVM-IR module to
facilitate
>  >> host-device code optimizations.
>  >>
>  >
>  > I think the main question I have is with respect to this limitation
> on the
>  > datalayout: isn't it too limiting in practice?
>  > I understand that this is much easier to implement in LLVM today, but
it
>  > may get us into a fairly limited place in terms of what can be
> supported in
>  > the future.
>  > Have you looked into what would it take to have heterogeneous modules
> that
>  > have their own DL?
>
>
> Let me share some thoughts on the data layouts situation, not all of
> which are
> fully matured but I guess we have to start somewhere:
>
> If we look at the host-device interface there has to be some agreement
> on parts of the datalayout, namely what the data looks like the host
> sends over and expects back. If I'm not mistaken, GPUs will match the
> host in things like padding, endianness, etc. because you cannot
> translate things "on the fly". That said, here might be
additional
> "address spaces" on either side that the other one is not
matching/aware
> of. Long story short, I think host & device need to, and in practice
do,
> agree on the data layout of the address space they use to communicate.
>
> The above is for me a strong hint that we could use address spaces to
> identify/distinguish differences when we link the modules. However,
> there might be the case that this is not sufficient, e.g., if the
> default alloca address space differs. In that case I don't see a reason
> to not pull the same "trick" as with the triple. We can specify
> additional data layouts, one per device, and if you retrieve the data
> layout, or triple, you need to pass a global symbol as a
"anchor". For
> all intraprocedural passes this should be sufficient as they are only
> interested in the DL and triple of the function they look at. For IPOs
> we have to distinguish the ones that know about the host-device calls
> and the ones that don't. We might have to teach all of them about these
> calls but as long as they are callbacks through a driver routine I
don't
> even think we need to.
>
> I'm curious if you or others see an immediate problem with both a
device
> specific DL and triple (optionally) associated with every global symbol.
>
Having a triple/DL per global symbols would likely solve everything, I
didn't get from your original email that this was considered.
If I understand correctly what you're describing, the DL on the Module
would be a "default" and we'd need to make the DL/triple APIs on
the Module
"private" to force queries to go through an API on GlobalValue to get
the
DL/triple?

>
>
> ~ Johannes
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20200728/830dc2ab/attachment.html>

Johannes Doerfert via llvm-dev

2020-Jul-28 19:28 UTC

head link

[llvm-dev] [RFC] Heterogeneous LLVM-IR Modules

On 7/28/20 2:25 PM, Mehdi AMINI wrote:
 > On Tue, Jul 28, 2020 at 12:07 PM Johannes Doerfert <
 > johannesdoerfert at gmail.com> wrote:
 >
 >> [I removed all but the data layout question, that is an important
topic]
 >> On 7/28/20 1:03 PM, Mehdi AMINI wrote:
 >>  > TL;DR
 >>  >> -----
 >>  >>
 >>  >> Let's allow to merge to LLVM-IR modules for different
targets (with
 >>  >> compatible data layouts) into a single LLVM-IR module to
facilitate
 >>  >> host-device code optimizations.
 >>  >>
 >>  >
 >>  > I think the main question I have is with respect to this
limitation
 >> on the
 >>  > datalayout: isn't it too limiting in practice?
 >>  > I understand that this is much easier to implement in LLVM
today,
but it
 >>  > may get us into a fairly limited place in terms of what can be
 >> supported in
 >>  > the future.
 >>  > Have you looked into what would it take to have heterogeneous
modules
 >> that
 >>  > have their own DL?
 >>
 >>
 >> Let me share some thoughts on the data layouts situation, not all of
 >> which are
 >> fully matured but I guess we have to start somewhere:
 >>
 >> If we look at the host-device interface there has to be some agreement
 >> on parts of the datalayout, namely what the data looks like the host
 >> sends over and expects back. If I'm not mistaken, GPUs will match
the
 >> host in things like padding, endianness, etc. because you cannot
 >> translate things "on the fly". That said, here might be
additional
 >> "address spaces" on either side that the other one is not
matching/aware
 >> of. Long story short, I think host & device need to, and in
practice do,
 >> agree on the data layout of the address space they use to communicate.
 >>
 >> The above is for me a strong hint that we could use address spaces to
 >> identify/distinguish differences when we link the modules. However,
 >> there might be the case that this is not sufficient, e.g., if the
 >> default alloca address space differs. In that case I don't see a
reason
 >> to not pull the same "trick" as with the triple. We can
specify
 >> additional data layouts, one per device, and if you retrieve the data
 >> layout, or triple, you need to pass a global symbol as a
"anchor". For
 >> all intraprocedural passes this should be sufficient as they are only
 >> interested in the DL and triple of the function they look at. For IPOs
 >> we have to distinguish the ones that know about the host-device calls
 >> and the ones that don't. We might have to teach all of them about
these
 >> calls but as long as they are callbacks through a driver routine I
don't
 >> even think we need to.
 >>
 >> I'm curious if you or others see an immediate problem with both a
device
 >> specific DL and triple (optionally) associated with every global
symbol.
 >>
 >
 > Having a triple/DL per global symbols would likely solve everything, I
 > didn't get from your original email that this was considered.
 > If I understand correctly what you're describing, the DL on the Module
 > would be a "default" and we'd need to make the DL/triple
APIs on the
Module
 > "private" to force queries to go through an API on GlobalValue
to get the
 > DL/triple?

That is what I tried to describe, yes. The "patch" I posted does this
"conceptually" for the triple. You make them private or require a
global
value to be passed as part of the request, same result I guess. The key
is that the DL/triple is a property of the global symbol.

I'll respond to Renato's concerns on this as part of a response to him.


 >>
 >>
 >> ~ Johannes
 >>
 >>
 >

Johannes Doerfert via llvm-dev

2020-Jul-28 19:42 UTC

head link

[llvm-dev] [RFC] Heterogeneous LLVM-IR Modules

On 7/28/20 2:24 PM, Renato Golin wrote:
 > On Tue, 28 Jul 2020 at 20:07, Johannes Doerfert via llvm-dev
 > <llvm-dev at lists.llvm.org> wrote:
 >> Long story short, I think host & device need to, and in practice
do,
 >> agree on the data layout of the address space they use to communicate.
 >
 > You can design APIs that call functions into external hardware that
 > have completely different data layout, you just need to properly pack
 > and unpack the arguments and results. IIUC, that's what you call
 > "agree on the DL"?

What I (tried to) describe is that you can pass an array of structs via
a CUDA memcpy (or similar) to the device and then expect it to be
accessible as an array of structs on the other side. I can imagine this
property doesn't hold for *every* programming model, but the question is
if we need to support the ones that don't have it. FWIW, I don't know if
it is worth to build up a system that can allow this property to be
missing or if it is better to not allow such systems to opt-in to the
heterogeneous module merging. I guess we would need to list the
programming models for which you cannot reasonably expect the above to
work.

 > In an LLVM module, with the single-DL requirement, this wouldn't work.
 > But if we had multiple named DLs and attributes to functions and
 > globals tagged with those DLs, then you could have multiple DLs on the
 > same module, as long as their control flow never reaches the other
 > (only through specific API calls), it should be "fine". However,
this
 > is hardly well defined and home to unlimited corner cases to handle.
 > Using namespaces would work for addresses, but other type sizes and
 > alignment would have to be defined anyway, then we're back to the
 > multiple-DL tags scenario.

I think that a multi-DL + multi-triple design seems like a good
candidate. I'm not sure about the corner cases you imagine but I guess
that is the nature of corner cases. And, to be fair, we haven't really
talked about much details yet. If we think there is a path forward we
could come up with restrictions and requirements. Hopefully convince
ourselves and others that it could work, or realize why not :)

 > Given that we're not allowing them to inline or interact, I wonder if
 > a "simpler" approach would be to allow more than one module per
 > "compile unit"? Those are some very strong quotes, mind you, but
it
 > would "solve" the DL problem entirely. Since both modules are in
 > memory, perhaps even passing through different pipelines (CPU, GPU,
 > FPGA), we can do constant propagation, kernel specialisation and
 > strong DCE by identifying the contact points, but still treating them
 > as separate modules. In essence, it would be the same as having them
 > on the same module, but without having to juggle function attributes
 > and data layout compatibility issues.
 >
 > The big question is, obviously, how many things would break if we had
 > two or more modules live at the same time. Global contexts would have
 > to be rewritten, but if each module passes on their own optimisation
 > pipelines, then the hardest part would be building the bridge between
 > them (call graph and other analysis) and keep that up-to-date as all
 > modules walk through their pipelines, so that passes like constant
 > propagation can "see" through the module barrier.

I am in doubt about the "simpler" part but it's an option. The one
disadvantage I see is that we have to change the way passes work in this
setting versus the single module setting. Or somehow pretend they are in
a single module at which point the entire separation seems to loose its
appeal. I still believe that callbacks (+IPO) can make optimization of
heterogeneous module look like the optimization of regular modules the
same way callbacks blur the line between IPO and IPO of parallel
programs e.g., across the "transitive call" performed by
pthread_create.

~ Johannes

Seemingly Similar Threads

Search for more maybe matching threads

llvm dev - Jul 2020 - [RFC] Heterogeneous LLVM-IR Modules

[llvm-dev] [RFC] Heterogeneous LLVM-IR Modules

[llvm-dev] [RFC] Heterogeneous LLVM-IR Modules

[llvm-dev] [RFC] Heterogeneous LLVM-IR Modules

[llvm-dev] [RFC] Heterogeneous LLVM-IR Modules

[llvm-dev] [RFC] Heterogeneous LLVM-IR Modules

Seemingly Similar Threads