thr3ads.net - llvm dev - [llvm-dev] [RFC] Heterogeneous LLVM-IR Modules [Jul 2020]

If this information is useful, please help other people find it:
Share via:

Johannes Doerfert via llvm-dev

2020-Jul-28 19:42 UTC

[llvm-dev] [RFC] Heterogeneous LLVM-IR Modules

On 7/28/20 2:24 PM, Renato Golin wrote:
 > On Tue, 28 Jul 2020 at 20:07, Johannes Doerfert via llvm-dev
 > <llvm-dev at lists.llvm.org> wrote:
 >> Long story short, I think host & device need to, and in practice
do,
 >> agree on the data layout of the address space they use to communicate.
 >
 > You can design APIs that call functions into external hardware that
 > have completely different data layout, you just need to properly pack
 > and unpack the arguments and results. IIUC, that's what you call
 > "agree on the DL"?

What I (tried to) describe is that you can pass an array of structs via
a CUDA memcpy (or similar) to the device and then expect it to be
accessible as an array of structs on the other side. I can imagine this
property doesn't hold for *every* programming model, but the question is
if we need to support the ones that don't have it. FWIW, I don't know if
it is worth to build up a system that can allow this property to be
missing or if it is better to not allow such systems to opt-in to the
heterogeneous module merging. I guess we would need to list the
programming models for which you cannot reasonably expect the above to
work.

 > In an LLVM module, with the single-DL requirement, this wouldn't work.
 > But if we had multiple named DLs and attributes to functions and
 > globals tagged with those DLs, then you could have multiple DLs on the
 > same module, as long as their control flow never reaches the other
 > (only through specific API calls), it should be "fine". However,
this
 > is hardly well defined and home to unlimited corner cases to handle.
 > Using namespaces would work for addresses, but other type sizes and
 > alignment would have to be defined anyway, then we're back to the
 > multiple-DL tags scenario.

I think that a multi-DL + multi-triple design seems like a good
candidate. I'm not sure about the corner cases you imagine but I guess
that is the nature of corner cases. And, to be fair, we haven't really
talked about much details yet. If we think there is a path forward we
could come up with restrictions and requirements. Hopefully convince
ourselves and others that it could work, or realize why not :)

 > Given that we're not allowing them to inline or interact, I wonder if
 > a "simpler" approach would be to allow more than one module per
 > "compile unit"? Those are some very strong quotes, mind you, but
it
 > would "solve" the DL problem entirely. Since both modules are in
 > memory, perhaps even passing through different pipelines (CPU, GPU,
 > FPGA), we can do constant propagation, kernel specialisation and
 > strong DCE by identifying the contact points, but still treating them
 > as separate modules. In essence, it would be the same as having them
 > on the same module, but without having to juggle function attributes
 > and data layout compatibility issues.
 >
 > The big question is, obviously, how many things would break if we had
 > two or more modules live at the same time. Global contexts would have
 > to be rewritten, but if each module passes on their own optimisation
 > pipelines, then the hardest part would be building the bridge between
 > them (call graph and other analysis) and keep that up-to-date as all
 > modules walk through their pipelines, so that passes like constant
 > propagation can "see" through the module barrier.

I am in doubt about the "simpler" part but it's an option. The one
disadvantage I see is that we have to change the way passes work in this
setting versus the single module setting. Or somehow pretend they are in
a single module at which point the entire separation seems to loose its
appeal. I still believe that callbacks (+IPO) can make optimization of
heterogeneous module look like the optimization of regular modules the
same way callbacks blur the line between IPO and IPO of parallel
programs e.g., across the "transitive call" performed by
pthread_create.

~ Johannes

Renato Golin via llvm-dev

2020-Jul-28 20:03 UTC

head link

[llvm-dev] [RFC] Heterogeneous LLVM-IR Modules

On Tue, 28 Jul 2020 at 20:44, Johannes Doerfert
<johannesdoerfert at gmail.com> wrote:> What I (tried to) describe is that you can pass an array of structs via
> a CUDA memcpy (or similar) to the device and then expect it to be
> accessible as an array of structs on the other side. I can imagine this
> property doesn't hold for *every* programming model, but the question
is
> if we need to support the ones that don't have it. FWIW, I don't
know if
> it is worth to build up a system that can allow this property to be
> missing or if it is better to not allow such systems to opt-in to the
> heterogeneous module merging. I guess we would need to list the
> programming models for which you cannot reasonably expect the above to
> work.
Right, this is the can of worms I think we won't see before it hits
us. My main concern is that allowing for opaque interfaces to be
defined means we'll be able to do almost anything around such simple
constraints, and the code won't be heavily tested around it (because
it's really hard to test those constraints).

For example, one constraint is: functions that cross the DL barrier
(ie. call functions in other DL) must marshall the arguments in a way
that the size in bytes is exactly what the function expects, given its
DL.

This is somewhat easy to verify, but it's not enough to guarantee that
the alignment of internal elements, structure layout, padding, etc
make sense in the target. Unless we write code that pack/unpack, we
cannot guarantee it is what we expect. And writing unpack code in GPU
may not even be meaningful. And it can change from one GPU family to
another, or one API to another.

Makes sense?

> I think that a multi-DL + multi-triple design seems like a good
> candidate.
I agree. Multiple-DL is something that comes and goes in the community
and so far the "consensus" has been that data layout is hard enough as
it is. I've always been keen on having it, but not keen on making it
happen (and fixing all the bugs that will come with it). :D

Another problem we haven't even considered is where the triple will
come from and in which form. As you know, triples don't usually mean
anything without further context, and that context isn't present in
the triple or the DL. They're lowered from the front-end in snippets
of code (pack/unpack, shift/mask, pad/store/pass pointer) or thunks
(EH, default class methods).

Once it's lowered, fine, DL should be mostly fine because everything
will be lowered anyway. But how will the user identify code from
multiple different front-ends on the same IR module? If we restrict
ourselves with a single front-end, then we'll need one front-end to
rule them all, and that would be counter productive (and fairly
limited scope for such a large change).

I fear the infrastructure issues around getting the code inside the
module will be more complicated (potentially intractable) than once we
have a multi-DL module to deal with...

> I am in doubt about the "simpler" part but it's an option.
That's an understatement. :)

But I think it's important to understand why, only if to make
multiple-DL modules more appealing.

> The one disadvantage I see is that we have to change the way passes work in
this
> setting versus the single module setting.
Passes will already have to change, as they can't look on every
function or every call, if they're done to a different DL. Probably a
simpler change, though.

cheers,
--renato

Johannes Doerfert via llvm-dev

2020-Jul-28 20:50 UTC

head link

[llvm-dev] [RFC] Heterogeneous LLVM-IR Modules

On 7/28/20 3:03 PM, Renato Golin wrote:
 > On Tue, 28 Jul 2020 at 20:44, Johannes Doerfert
 > <johannesdoerfert at gmail.com> wrote:
 >> What I (tried to) describe is that you can pass an array of structs
via
 >> a CUDA memcpy (or similar) to the device and then expect it to be
 >> accessible as an array of structs on the other side. I can imagine
this
 >> property doesn't hold for *every* programming model, but the
question is
 >> if we need to support the ones that don't have it. FWIW, I
don't know if
 >> it is worth to build up a system that can allow this property to be
 >> missing or if it is better to not allow such systems to opt-in to the
 >> heterogeneous module merging. I guess we would need to list the
 >> programming models for which you cannot reasonably expect the above to
 >> work.
 >
 > Right, this is the can of worms I think we won't see before it hits
 > us. My main concern is that allowing for opaque interfaces to be
 > defined means we'll be able to do almost anything around such simple
 > constraints, and the code won't be heavily tested around it (because
 > it's really hard to test those constraints).
 >
 > For example, one constraint is: functions that cross the DL barrier
 > (ie. call functions in other DL) must marshall the arguments in a way
 > that the size in bytes is exactly what the function expects, given its
 > DL.
 >
 > This is somewhat easy to verify, but it's not enough to guarantee that
 > the alignment of internal elements, structure layout, padding, etc
 > make sense in the target. Unless we write code that pack/unpack, we
 > cannot guarantee it is what we expect. And writing unpack code in GPU
 > may not even be meaningful. And it can change from one GPU family to
 > another, or one API to another.
 >
 > Makes sense?

Kind of. I get the theoretical concern but I am questioning if we need
to support that at all. What I try to say is that for the programming
models I am aware of this is not a concern to begin with. The
accelerator actually matches the host data layout. Let's take OpenMP.
The compiler cannot know what your memory actually is because types are,
you know, just hints for the most part. So we need the devices to match
the host data layout wrt. padding, alignment, etc. or we could not copy
an array of structs from one to the other and expect it to work. CUDA,
HIP, SYCL, ... should all be the same. I hope someone corrects me if I
have some misconceptions here :)


 >> I think that a multi-DL + multi-triple design seems like a good
 >> candidate.
 >
 > I agree. Multiple-DL is something that comes and goes in the community
 > and so far the "consensus" has been that data layout is hard
enough as
 > it is. I've always been keen on having it, but not keen on making it
 > happen (and fixing all the bugs that will come with it). :D
 >
 > Another problem we haven't even considered is where the triple will
 > come from and in which form. As you know, triples don't usually mean
 > anything without further context, and that context isn't present in
 > the triple or the DL. They're lowered from the front-end in snippets
 > of code (pack/unpack, shift/mask, pad/store/pass pointer) or thunks
 > (EH, default class methods).
 >
 > Once it's lowered, fine, DL should be mostly fine because everything
 > will be lowered anyway. But how will the user identify code from
 > multiple different front-ends on the same IR module? If we restrict
 > ourselves with a single front-end, then we'll need one front-end to
 > rule them all, and that would be counter productive (and fairly
 > limited scope for such a large change).
 >
 > I fear the infrastructure issues around getting the code inside the
 > module will be more complicated (potentially intractable) than once we
 > have a multi-DL module to deal with...
 >
 >
 >> I am in doubt about the "simpler" part but it's an
option.
 >
 > That's an understatement. :)
 >
 > But I think it's important to understand why, only if to make
 > multiple-DL modules more appealing.

Fair. And I'm open to be convinced this is the right approach after all.


 >> The one disadvantage I see is that we have to change the way passes 
work in this
 >> setting versus the single module setting.
 >
 > Passes will already have to change, as they can't look on every
 > function or every call, if they're done to a different DL. Probably a
 > simpler change, though.

Again, I'm not so sure. As long as the interface is opaque, e.g., calls
from host go through a driver API, I doubt there is really a problem.

I imagine it somehow like OpenMP looks, here my conceptual model:

---
// Host
static char KERNEL_ID;

void offload(void *payload) {
   driver_invoke_kernel(&KERNEL_ID, payload);
}

__attribute__((callback (cb_fn, arg)))
void driver_invoke_kernel(void(*cb_fn)(void *), void *arg);


// Device
device 1 global alias KERNEL_ID = kernel;     // "LLVM-IR-like syntax"

device 1 void kernel(void *payload) {
    auto *typed_payload = static_cast<...>(payload);
    ...
}
---

The "important" part is there is no direct call edge between the two
modules. If we want to support that, we can have that discussion for
sure. I initially assumed we might not need/want to allow that. If not,
I don't see how any interpretation of the IR could lead to a problem.
Every pass, including IPO, can go off an optimize as long as they are
not callback and device ID aware they will not make the connection from
`offload` to `kernel`. Again, I might not know the models that have a
direct call edge or we want to prepare for them. In those cases we need
to care for sure.


~ Johannes

Maybe Matching Threads

Search for more maybe matching threads

llvm dev - Jul 2020 - [RFC] Heterogeneous LLVM-IR Modules

[llvm-dev] [RFC] Heterogeneous LLVM-IR Modules

[llvm-dev] [RFC] Heterogeneous LLVM-IR Modules

[llvm-dev] [RFC] Heterogeneous LLVM-IR Modules

Maybe Matching Threads