thr3ads.net - llvm dev - [LLVMdev] [PROPOSAL] LLVM multi-module support [Jul 2012]

If this information is useful, please help other people find it:
Share via:

Tobias Grosser

2012-Jul-26 06:35 UTC

[LLVMdev] [PROPOSAL] LLVM multi-module support

Hi,

a couple of weeks ago I discussed with Peter how to improve LLVM's 
support for heterogeneous computing. One weakness we (and others) have 
seen is the absence of multi-module support in LLVM. Peter came up with 
a nice idea how to improve here. I would like to put this idea up for 
discussion.

## The problem ##

LLVM-IR modules can currently only contain code for a single target 
architecture. However, there are multiple use cases where one 
translation unit could contain code for several architectures.

1) CUDA

cuda source files can contain both host and device code. The absence of 
multi-module support complicates adding CUDA support to clang, as clang 
would need to perform multi-module compilation on top of a single-module 
based compiler framework.

2) C++ AMP

C++ AMP [1] contains - similarly to CUDA - both host code and device 
code in the same source file. Even if C++ AMP is a Microsoft extension 
the use case itself is relevant to clang. It would be great if LLVM 
would provide infrastructure, such that front-ends could easily target 
accelerators. This would probably yield a lot of interesting experiments.

3) Optimizers

To fully automatically offload computations to an accelerator an 
optimization pass needs to extract the computation kernels and schedule
them as separate kernels on the device. Such kernels are normally 
LLVM-IR modules for different architectures. At the moment, passes have 
no way to create and store new LLVM-IR modules. There is also no way
to reference kernel LLVM-IR modules from a host module (which is 
necessary to pass them to the accelerator run-time).

## Goals ##

a) No major changes to existing tools and LLVM based applications

b) Human readable and writable LLVM-IR

c) FileCheck testability

d) Do not force a specific execution model

e) Unlimited number of embedded modules

## Detailed Goals

a)
  o No changes should be required, if a tool does not use multi-module
    support. Each LLVM-IR file valid today, should remain valid.

  o Major tools should support basic heterogeneous modules without large
    changes. Some of the commands that should work after smaller
    adaptions:

    clang -S -emit-llvm -o out.ll
    opt -O3 out.ll -o out.opt.ll
    llc out.opt.ll
    lli out.opt.ll
    bugpoint -O3 out.opt.ll

b) All (sub)modules should be directly human readable/writable.
    There should be no need to extract single modules before modifying
    them.

c) The LLVM-IR generated from a heterogeneous multi-module should
    easily be 'FileCheck'able. The same is true, if a multi-module is
    the result of an optimization.

d) In CUDA/OpenCL/C++ AMP kernels are scheduled from within the host
    code. This means arbitrary host code can decide under which
    conditions kernels are scheduled for execution. It is therefore
    necessary to reference individual sub-modules from within the host
    module.

e) CUDA/OpenCL allows to compile and schedule an arbitrary number of
    kernels. We do not want to put an artificial limit on the number of
    modules they are represented in. This means a single embedded
    submodule is not enough.

## Non Goals ##

o Modeling sub-architectures on a per-function basis

Functions could be specialized for a certain sub-architecture. This is 
helpful to have certain functions optimized e.g. with AVX2 enabled, but 
the general program being compiled for a more generic architecture.
We do not address per-function annotations in this proposal.

## Proposed solution ##

To bring multi-module support to LLVM, we propose to add a new type 
called 'llvmir' to LLVM-IR. It can be used to embed LLVM-IR submodules
as global variables.

------------------------------------------------------------------------
target datalayout = ...
target triple = "x86_64-unknown-linux-gnu"

@llvm_kernel = private unnamed_addr constant llvm_kernel {
   target triple = nvptx64-unknown-unknown
   define internal ptx_kernel void @gpu_kernel(i8* %Array) {
     ...
   }
}
------------------------------------------------------------------------

By default the global will be compiled to a llvm string stored in the 
object file. We could also think about translating it to PTX or AMD's 
HSA-IL, such that e.g. PTX can be passed to a run-time library.

 From my point of view, Peters idea allows us to add multi-module 
support in a way that allows us to reach the goals described above. 
However, to properly design and implement it, early feedback would be 
valuable.

Cheers
Tobi

[1] http://msdn.microsoft.com/en-us/library/hh265137%28v=vs.110%29
[2] 
http://www.amd.com/us/press-releases/Pages/amd-arm-computing-innovation-2012june12.aspx

Duncan Sands

2012-Jul-26 07:19 UTC

head link

[LLVMdev] [PROPOSAL] LLVM multi-module support

Hi Tobias, I didn't really get it.  Is the idea that the same bitcode is
going to be codegen'd for different architectures, or is each sub-module
going to contain different bitcode?  In the later case you may as well
just use multiple modules, perhaps in conjunction with a scheme to store
more than one module in the same file on disk as a convenience.

Ciao, Duncan.
> a couple of weeks ago I discussed with Peter how to improve LLVM's
> support for heterogeneous computing. One weakness we (and others) have
> seen is the absence of multi-module support in LLVM. Peter came up with
> a nice idea how to improve here. I would like to put this idea up for
> discussion.
>
> ## The problem ##
>
> LLVM-IR modules can currently only contain code for a single target
> architecture. However, there are multiple use cases where one
> translation unit could contain code for several architectures.
>
> 1) CUDA
>
> cuda source files can contain both host and device code. The absence of
> multi-module support complicates adding CUDA support to clang, as clang
> would need to perform multi-module compilation on top of a single-module
> based compiler framework.
>
> 2) C++ AMP
>
> C++ AMP [1] contains - similarly to CUDA - both host code and device
> code in the same source file. Even if C++ AMP is a Microsoft extension
> the use case itself is relevant to clang. It would be great if LLVM
> would provide infrastructure, such that front-ends could easily target
> accelerators. This would probably yield a lot of interesting experiments.
>
> 3) Optimizers
>
> To fully automatically offload computations to an accelerator an
> optimization pass needs to extract the computation kernels and schedule
> them as separate kernels on the device. Such kernels are normally
> LLVM-IR modules for different architectures. At the moment, passes have
> no way to create and store new LLVM-IR modules. There is also no way
> to reference kernel LLVM-IR modules from a host module (which is
> necessary to pass them to the accelerator run-time).
>
> ## Goals ##
>
> a) No major changes to existing tools and LLVM based applications
>
> b) Human readable and writable LLVM-IR
>
> c) FileCheck testability
>
> d) Do not force a specific execution model
>
> e) Unlimited number of embedded modules
>
> ## Detailed Goals
>
> a)
>    o No changes should be required, if a tool does not use multi-module
>      support. Each LLVM-IR file valid today, should remain valid.
>
>    o Major tools should support basic heterogeneous modules without large
>      changes. Some of the commands that should work after smaller
>      adaptions:
>
>      clang -S -emit-llvm -o out.ll
>      opt -O3 out.ll -o out.opt.ll
>      llc out.opt.ll
>      lli out.opt.ll
>      bugpoint -O3 out.opt.ll
>
> b) All (sub)modules should be directly human readable/writable.
>      There should be no need to extract single modules before modifying
>      them.
>
> c) The LLVM-IR generated from a heterogeneous multi-module should
>      easily be 'FileCheck'able. The same is true, if a multi-module
is
>      the result of an optimization.
>
> d) In CUDA/OpenCL/C++ AMP kernels are scheduled from within the host
>      code. This means arbitrary host code can decide under which
>      conditions kernels are scheduled for execution. It is therefore
>      necessary to reference individual sub-modules from within the host
>      module.
>
> e) CUDA/OpenCL allows to compile and schedule an arbitrary number of
>      kernels. We do not want to put an artificial limit on the number of
>      modules they are represented in. This means a single embedded
>      submodule is not enough.
>
> ## Non Goals ##
>
> o Modeling sub-architectures on a per-function basis
>
> Functions could be specialized for a certain sub-architecture. This is
> helpful to have certain functions optimized e.g. with AVX2 enabled, but
> the general program being compiled for a more generic architecture.
> We do not address per-function annotations in this proposal.
>
> ## Proposed solution ##
>
> To bring multi-module support to LLVM, we propose to add a new type
> called 'llvmir' to LLVM-IR. It can be used to embed LLVM-IR
submodules
> as global variables.
>
> ------------------------------------------------------------------------
> target datalayout = ...
> target triple = "x86_64-unknown-linux-gnu"
>
> @llvm_kernel = private unnamed_addr constant llvm_kernel {
>     target triple = nvptx64-unknown-unknown
>     define internal ptx_kernel void @gpu_kernel(i8* %Array) {
>       ...
>     }
> }
> ------------------------------------------------------------------------
>
> By default the global will be compiled to a llvm string stored in the
> object file. We could also think about translating it to PTX or AMD's
> HSA-IL, such that e.g. PTX can be passed to a run-time library.
>
>   From my point of view, Peters idea allows us to add multi-module
> support in a way that allows us to reach the goals described above.
> However, to properly design and implement it, early feedback would be
> valuable.
>
> Cheers
> Tobi
>
> [1] http://msdn.microsoft.com/en-us/library/hh265137%28v=vs.110%29
> [2]
>
http://www.amd.com/us/press-releases/Pages/amd-arm-computing-innovation-2012june12.aspx
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>

Dmitry N. Mikushin

2012-Jul-26 10:42 UTC

head link

[LLVMdev] [PROPOSAL] LLVM multi-module support

In our project we combine regular binary code and LLVM IR code for kernels,
embedded as a special data symbol of ELF object. The LLVM IR for kernel
existing at compile-time is preliminary, and may be optimized further
during runtime (pointers analysis, polly, etc.). During application
startup, runtime system builds an index of all kernels sources embedded
into the executable. Host and kernel code interact by means of special
"launch" call, which does not only optimize&compile&execute
the kernel, but
first makes an estimation if it is worth to, or better to fall back to host
code equivalent.

Proposal made by Tobias is very elegant, but it seems to be addressing the
case when host and sub-architectures' code exist in the same time. May I
kindly point out that to our experience the really efficient deeply
specialized sub-architectures code may simply not exist at compile time,
while the generic baseline host code always can.

Best,
- Dima.

2012/7/26 Duncan Sands <baldrick at free.fr>
> Hi Tobias, I didn't really get it.  Is the idea that the same bitcode
is
> going to be codegen'd for different architectures, or is each
sub-module
> going to contain different bitcode?  In the later case you may as well
> just use multiple modules, perhaps in conjunction with a scheme to store
> more than one module in the same file on disk as a convenience.
>
> Ciao, Duncan.
>
> > a couple of weeks ago I discussed with Peter how to improve LLVM's
> > support for heterogeneous computing. One weakness we (and others) have
> > seen is the absence of multi-module support in LLVM. Peter came up
with
> > a nice idea how to improve here. I would like to put this idea up for
> > discussion.
> >
> > ## The problem ##
> >
> > LLVM-IR modules can currently only contain code for a single target
> > architecture. However, there are multiple use cases where one
> > translation unit could contain code for several architectures.
> >
> > 1) CUDA
> >
> > cuda source files can contain both host and device code. The absence
of
> > multi-module support complicates adding CUDA support to clang, as
clang
> > would need to perform multi-module compilation on top of a
single-module
> > based compiler framework.
> >
> > 2) C++ AMP
> >
> > C++ AMP [1] contains - similarly to CUDA - both host code and device
> > code in the same source file. Even if C++ AMP is a Microsoft extension
> > the use case itself is relevant to clang. It would be great if LLVM
> > would provide infrastructure, such that front-ends could easily target
> > accelerators. This would probably yield a lot of interesting
experiments.
> >
> > 3) Optimizers
> >
> > To fully automatically offload computations to an accelerator an
> > optimization pass needs to extract the computation kernels and
schedule
> > them as separate kernels on the device. Such kernels are normally
> > LLVM-IR modules for different architectures. At the moment, passes
have
> > no way to create and store new LLVM-IR modules. There is also no way
> > to reference kernel LLVM-IR modules from a host module (which is
> > necessary to pass them to the accelerator run-time).
> >
> > ## Goals ##
> >
> > a) No major changes to existing tools and LLVM based applications
> >
> > b) Human readable and writable LLVM-IR
> >
> > c) FileCheck testability
> >
> > d) Do not force a specific execution model
> >
> > e) Unlimited number of embedded modules
> >
> > ## Detailed Goals
> >
> > a)
> >    o No changes should be required, if a tool does not use
multi-module
> >      support. Each LLVM-IR file valid today, should remain valid.
> >
> >    o Major tools should support basic heterogeneous modules without
large
> >      changes. Some of the commands that should work after smaller
> >      adaptions:
> >
> >      clang -S -emit-llvm -o out.ll
> >      opt -O3 out.ll -o out.opt.ll
> >      llc out.opt.ll
> >      lli out.opt.ll
> >      bugpoint -O3 out.opt.ll
> >
> > b) All (sub)modules should be directly human readable/writable.
> >      There should be no need to extract single modules before
modifying
> >      them.
> >
> > c) The LLVM-IR generated from a heterogeneous multi-module should
> >      easily be 'FileCheck'able. The same is true, if a
multi-module is
> >      the result of an optimization.
> >
> > d) In CUDA/OpenCL/C++ AMP kernels are scheduled from within the host
> >      code. This means arbitrary host code can decide under which
> >      conditions kernels are scheduled for execution. It is therefore
> >      necessary to reference individual sub-modules from within the
host
> >      module.
> >
> > e) CUDA/OpenCL allows to compile and schedule an arbitrary number of
> >      kernels. We do not want to put an artificial limit on the number
of
> >      modules they are represented in. This means a single embedded
> >      submodule is not enough.
> >
> > ## Non Goals ##
> >
> > o Modeling sub-architectures on a per-function basis
> >
> > Functions could be specialized for a certain sub-architecture. This is
> > helpful to have certain functions optimized e.g. with AVX2 enabled,
but
> > the general program being compiled for a more generic architecture.
> > We do not address per-function annotations in this proposal.
> >
> > ## Proposed solution ##
> >
> > To bring multi-module support to LLVM, we propose to add a new type
> > called 'llvmir' to LLVM-IR. It can be used to embed LLVM-IR
submodules
> > as global variables.
> >
> >
------------------------------------------------------------------------
> > target datalayout = ...
> > target triple = "x86_64-unknown-linux-gnu"
> >
> > @llvm_kernel = private unnamed_addr constant llvm_kernel {
> >     target triple = nvptx64-unknown-unknown
> >     define internal ptx_kernel void @gpu_kernel(i8* %Array) {
> >       ...
> >     }
> > }
> >
------------------------------------------------------------------------
> >
> > By default the global will be compiled to a llvm string stored in the
> > object file. We could also think about translating it to PTX or
AMD's
> > HSA-IL, such that e.g. PTX can be passed to a run-time library.
> >
> >   From my point of view, Peters idea allows us to add multi-module
> > support in a way that allows us to reach the goals described above.
> > However, to properly design and implement it, early feedback would be
> > valuable.
> >
> > Cheers
> > Tobi
> >
> > [1] http://msdn.microsoft.com/en-us/library/hh265137%28v=vs.110%29
> > [2]
> >
>
http://www.amd.com/us/press-releases/Pages/amd-arm-computing-innovation-2012june12.aspx
> > _______________________________________________
> > LLVM Developers mailing list
> > LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
> >
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20120726/569cb614/attachment.html>

dag at cray.com

2012-Jul-26 15:09 UTC

head link

[LLVMdev] [PROPOSAL] LLVM multi-module support

Tobias Grosser <tobias at grosser.es> writes:
> o Modeling sub-architectures on a per-function basis
>
> Functions could be specialized for a certain sub-architecture. This is 
> helpful to have certain functions optimized e.g. with AVX2 enabled, but 
> the general program being compiled for a more generic architecture.
> We do not address per-function annotations in this proposal.
Could this be accomplished using a separate module for the specialized
function of interest under your proposal?
> ## Proposed solution ##
>
> To bring multi-module support to LLVM, we propose to add a new type 
> called 'llvmir' to LLVM-IR. It can be used to embed LLVM-IR
submodules
> as global variables.
>
> ------------------------------------------------------------------------
> target datalayout = ...
> target triple = "x86_64-unknown-linux-gnu"
>
> @llvm_kernel = private unnamed_addr constant llvm_kernel {
>    target triple = nvptx64-unknown-unknown
>    define internal ptx_kernel void @gpu_kernel(i8* %Array) {
>      ...
>    }
> }
> ------------------------------------------------------------------------
>
> By default the global will be compiled to a llvm string stored in the 
> object file. We could also think about translating it to PTX or AMD's 
> HSA-IL, such that e.g. PTX can be passed to a run-time library.
Hmm...I'm not sure about this model.  Not every accelerator execution
model out there takes code as a string.  Some want natively-compiled
binaries.
>  From my point of view, Peters idea allows us to add multi-module 
> support in a way that allows us to reach the goals described above. 
> However, to properly design and implement it, early feedback would be 
> valuable.
I really don't like this at first glance.  Anything that results in a
string means that we can't use normal tools to manipulate it.  I
understand the string representation is desirable for some targets but
it seems to really cripple others.  The object file output should at
least be configurable.  Some targets might even want separate asm files
for the various architectures.

                                 -Dave

dag at cray.com

2012-Jul-26 15:13 UTC

head link

[LLVMdev] [PROPOSAL] LLVM multi-module support

Duncan Sands <baldrick at free.fr> writes:
> Hi Tobias, I didn't really get it.  Is the idea that the same bitcode
is
> going to be codegen'd for different architectures, or is each
sub-module
> going to contain different bitcode?  In the later case you may as well
> just use multiple modules, perhaps in conjunction with a scheme to store
> more than one module in the same file on disk as a convenience.
I tend to agree.  Why do we need a whole new submodule concept?

                              -Dave

Tobias Grosser

2012-Jul-29 19:16 UTC

head link

[LLVMdev] [PROPOSAL] LLVM multi-module support

On 07/26/2012 12:49 PM, Duncan Sands wrote:> Hi Tobias, I didn't really get it. Is the idea that the same bitcode
is
> going to be codegen'd for different architectures, or is each
sub-module
> going to contain different bitcode? In the later case you may as well
> just use multiple modules, perhaps in conjunction with a scheme to store
> more than one module in the same file on disk as a convenience.
Hi Duncan,

thanks for your reply.

The proposal may allow both, sub-modules that contain different bitcode,
but also sub-modules that are code generated differently.

Different bitcode may arise from sub-modules that represent different
program parts, but also because we want to create different sub-modules
for a single program part e.g. to optimize for specific hardware.

In the back-end, sub-modules could be code generated according to the
requirements of the run-time system that will load them. For NVIDIA
chips we could code generate PTX, for AMD systems AMD-IL may be an option.

You and several others (Justin e.g) pointed out that multi-modules in
LLVM-IR (or the llvm.codegen intrinsics) just reinvent the tar archive
system. I can follow your thoughts here.

Thinking of how to add cuda support to clang a possible approach here is
to modify clang to emit device and host code to different modules,
compile each module separately and than add logic to clang to merge the
two modules in the end. This is a very reasonable approach and there are
no doubts, adding multi-module support to LLVM just to simplify this
single use case is not the right thing to do.

With multi-module support I am aiming for something else. As you know,
LLVM allows to "-load" optimizer plugins at run-time and every LLVM
based compiler being it clang/ghc/dragonegg/lli/... can take advantage
of them with almost no source code changes. I believe this is a very
nice feature, as it allows to prototype and test new optimizations
easily and without any changes to the core compilers itself. This works
not only for simple IR transformations, but even autoparallelisation
works well, as calls to libgomp can easily be added.

The next step we were looking into was automatically offloading some
calculations to an accelerator. This is actually very similar to OpenMP
parallelisation, but, instead of calls to libgomp, calls to libcuda or
libopencl need to be scheduled. The only major difference is that the
kernel code is not just a simple function in the host module, but a
entirely new module. Hence an optimizer somehow needs to extract those
modules and needs to pass a reference to them to the cuda or opencl runtime.

The driving motivation for my proposal was to extend LLVM, such that
optimization passes for heterogeneous architectures can be run in
LLVM based compilers with no or little changes to the compiler source
code. I think having this functionality will allow people to test new
ideas more easily and will avoid the need for each project to create its
own tool chain. It will also allow one optimizer to work most tools
(clang/ghc/dragonegg/lli) without the need for larger changes.

From the discussion about our last proposal, the llvm.codegen()
intrinsic, I took the conclusion that people are mostly concerned about
interpreting arbitrary strings embedded into an LLVM-IR file and that
people suggested explicit LLVM-IR extensions as one possible solution.
So I was hoping, this proposal could address some of the previously
raised concern. However, apparently people do not really see a need for
stronger support for heterogeneous compilation directly within LLVM. Or
the other way around, I fail to see how to achieve the same goals with
the existing infrastructure or some of the suggestions people made. I
will probably need to understand some of the ideas pointed out.

Thanks again for your feedback

Cheers
Tobi

Tobias Grosser

2012-Jul-29 19:38 UTC

head link

[LLVMdev] [PROPOSAL] LLVM multi-module support

On 07/26/2012 08:39 PM, dag at cray.com wrote:> Tobias Grosser <tobias at grosser.es> writes:
>
>> o Modeling sub-architectures on a per-function basis
>>
>> Functions could be specialized for a certain sub-architecture. This is
>> helpful to have certain functions optimized e.g. with AVX2 enabled, but
>> the general program being compiled for a more generic architecture.
>> We do not address per-function annotations in this proposal.
>
> Could this be accomplished using a separate module for the specialized
> function of interest under your proposal?
In my proposal, different modules have different address spaces. Also, I 
don't aim to support function calls across module boundaries. So having 
a separate module for this function does not seem to be a solution.
>> ## Proposed solution ##
>>
>> To bring multi-module support to LLVM, we propose to add a new type
>> called 'llvmir' to LLVM-IR. It can be used to embed LLVM-IR
submodules
>> as global variables.
>>
>>
------------------------------------------------------------------------
>> target datalayout = ...
>> target triple = "x86_64-unknown-linux-gnu"
>>
>> @llvm_kernel = private unnamed_addr constant llvm_kernel {
>>     target triple = nvptx64-unknown-unknown
>>     define internal ptx_kernel void @gpu_kernel(i8* %Array) {
>>       ...
>>     }
>> }
>>
------------------------------------------------------------------------
>>
>> By default the global will be compiled to a llvm string stored in the
>> object file. We could also think about translating it to PTX or
AMD's
>> HSA-IL, such that e.g. PTX can be passed to a run-time library.
>
> Hmm...I'm not sure about this model.  Not every accelerator execution
> model out there takes code as a string.  Some want natively-compiled
> binaries.
If LLVM provides an object code emitter for the relevant back-end, we 
could also think about emitting native binaries. Storing the assembly as 
a string is just a 'default' output.
>>   From my point of view, Peters idea allows us to add multi-module
>> support in a way that allows us to reach the goals described above.
>> However, to properly design and implement it, early feedback would be
>> valuable.
>
> I really don't like this at first glance.  Anything that results in a
> string means that we can't use normal tools to manipulate it.
> I
> understand the string representation is desirable for some targets but
> it seems to really cripple others.  The object file output should at
> least be configurable.  Some targets might even want separate asm files
> for the various architectures.
I see 'string' just as a default output, but would have hoped we could 
provide other outputs as needed. Do you see any reason, why we could not 
emit native code for some of the embedded sub-modules?

Thanks for your feedback
Tobi

Possibly Parallel Threads

Search for more reasonably related threads

llvm dev - Jul 2012 - [LLVMdev] [PROPOSAL] LLVM multi-module support

[LLVMdev] [PROPOSAL] LLVM multi-module support

[LLVMdev] [PROPOSAL] LLVM multi-module support

[LLVMdev] [PROPOSAL] LLVM multi-module support

[LLVMdev] [PROPOSAL] LLVM multi-module support

[LLVMdev] [PROPOSAL] LLVM multi-module support

[LLVMdev] [PROPOSAL] LLVM multi-module support

[LLVMdev] [PROPOSAL] LLVM multi-module support

Possibly Parallel Threads