thr3ads.net - llvm dev - [LLVMdev] [PATCH][RFC] Add llvm.codegen Intrinsic To Support Embedded LLVM IR Code Generation [May 2012]

If this information is useful, please help other people find it:
Share via:

dag at cray.com

2012-Apr-30 19:55 UTC

[LLVMdev] [PATCH][RFC] Add llvm.codegen Intrinsic To Support Embedded LLVM IR Code Generation

Tobias Grosser <tobias at grosser.es> writes:
> To write optimizations that yield embedded GPU code, we also looked into 
> three other approaches:
>
> 1. Directly create embedded target code (e.g. PTX)
>
> This would mean the optimization pass extracts device code internally 
> and directly generate the relevant target code. This approach would 
> require our generic optimization pass to be directly linked with the 
> specific target back end. This is an ugly layering violation and, in 
> addition, it causes major troubles in case the new optimization should 
> be dynamically loaded.
IMHO it's a bit unrealistic to have a target-independent optimization
layer.  Almost all optimization wants to know target details at some
point.  I think we can and probably should support that.  We can allow
passes to gracefully fall back in the cases where target information is
not available.
> 2. Extend the LLVM-IR files to support heterogeneous modules
>
> This would mean we extend LLVM-IR, such that IR for different targets
> can be stored within a single IR file. This approach could be integrated 
> nicely into the LLVM code generation flow and would yield readable 
> LLVM-IR even for the device code. However, it adds another level of 
> complexity to the LLVM-IR files and does not only require massive 
> changes in the LLVM code base, but also in compilers built on top of 
> LLVM-IR.
I don't think the code base changes are all that bad.  We have a number
of them to support generating code one function at a time rather than a
whole module together.  They've been sitting around waiting for us to
send them upstream.  It would be an easy matter to simply annotate each
function with its target.  We don't currently do that because we never
write out such IR files but it seems like a simple problem to solve to
me.
> 3. Generate two independent LLVM-IR files and pass them around together
>
> The host and device LLVM-IR modules could be kept in separate files. 
> This has the benefit of being user readable and not adding additional 
> complexity to the LLVM-IR files itself. However, separate files do not 
> provide information about how those files are related. Which files are 
> kernel files, how.where do they need to be loaded, ...? Also this 
> information could probably be put into meta-data or could be hard coded
> into the generic compiler infrastructure, but this would require 
> significant additional code.
I don't think metadata would work because it would not satisfy the "no
semantic effects" requirement.  We couldn't just drop the metadata and
expect things to work.
> Another weakness of this approach is that the entire LLVM optimization 
> chain is currently built under the assumption that a single file/module 
> passed around. This is most obvious with the 'opt | llc' idiom, but
in
> general every tool that does currently exist would need to be adapted to 
> handle multiple files and would possibly even need semantic knowledge 
> about how to connect/use them together. Just running clang or
> draggonegg with -load GPGPUOptimizer.so would not be possible.
Again, we have many of the changes to make this possible.  I hope to
send them for review as we upgrade to 3.1.
> All of the previous approaches require significant changes all over the 
> code base and would cause trouble with loadable optimization passes. The 
> intrinsic based approach seems to address most of the previous problems.
I'm pretty uncomfortable with the proposed intrinsic.  It feels
tacked-on and not in the LLVM spirit.  We should be able to extend the
IR to support multiple targets.  We're going to need this kind of
support for much more than GPUs in thefuture.  Heterogenous computing is
here to stay.

                             -Dave

dag at cray.com

2012-Apr-30 20:03 UTC

head link

[LLVMdev] [PATCH][RFC] Add llvm.codegen Intrinsic To Support Embedded LLVM IR Code Generation

<dag at cray.com> writes:
> Tobias Grosser <tobias at grosser.es> writes:
>
>> To write optimizations that yield embedded GPU code, we also looked
into
>> three other approaches:
>>
>> 1. Directly create embedded target code (e.g. PTX)
>>
>> This would mean the optimization pass extracts device code internally 
>> and directly generate the relevant target code. This approach would 
>> require our generic optimization pass to be directly linked with the 
>> specific target back end. This is an ugly layering violation and, in 
>> addition, it causes major troubles in case the new optimization should 
>> be dynamically loaded.
>
> IMHO it's a bit unrealistic to have a target-independent optimization
> layer.  Almost all optimization wants to know target details at some
> point.  I think we can and probably should support that.  We can allow
> passes to gracefully fall back in the cases where target information is
> not available.
I think I misread your intent here.  It is indeed a very bad layering
violation to have opt generate code.  In the response above I am talking
about making target characteristics available to opt passes if it is
available.  I think the latter is important to get good performance.

                                  -Dave

Justin Holewinski

2012-May-01 04:32 UTC

head link

[LLVMdev] [PATCH][RFC] Add llvm.codegen Intrinsic To Support Embedded LLVM IR Code Generation

On Mon, Apr 30, 2012 at 12:55 PM, <dag at cray.com> wrote:
> Tobias Grosser <tobias at grosser.es> writes:
>
> > To write optimizations that yield embedded GPU code, we also looked
into
> > three other approaches:
> >
> > 1. Directly create embedded target code (e.g. PTX)
> >
> > This would mean the optimization pass extracts device code internally
> > and directly generate the relevant target code. This approach would
> > require our generic optimization pass to be directly linked with the
> > specific target back end. This is an ugly layering violation and, in
> > addition, it causes major troubles in case the new optimization should
> > be dynamically loaded.
>
> IMHO it's a bit unrealistic to have a target-independent optimization
> layer.  Almost all optimization wants to know target details at some
> point.  I think we can and probably should support that.  We can allow
> passes to gracefully fall back in the cases where target information is
> not available.
>
> > 2. Extend the LLVM-IR files to support heterogeneous modules
> >
> > This would mean we extend LLVM-IR, such that IR for different targets
> > can be stored within a single IR file. This approach could be
integrated
> > nicely into the LLVM code generation flow and would yield readable
> > LLVM-IR even for the device code. However, it adds another level of
> > complexity to the LLVM-IR files and does not only require massive
> > changes in the LLVM code base, but also in compilers built on top of
> > LLVM-IR.
>
> I don't think the code base changes are all that bad.  We have a number
> of them to support generating code one function at a time rather than a
> whole module together.  They've been sitting around waiting for us to
> send them upstream.  It would be an easy matter to simply annotate each
> function with its target.  We don't currently do that because we never
> write out such IR files but it seems like a simple problem to solve to
> me.
>
If such changes are almost ready to be up-streamed, then great!  It just
seems like a fairly non-trivial task to actually implement function-level
target selection, especially when you consider function call semantics,
taking the address of a function, etc.  If you have a global variable, what
target "sees" it?  Does it need to be annotated along with the
function?
 Can functions from two different targets share this pointer?  At first
glance, there seems to be many non-trivial issues that are heavily
dependent on the nature of the target.  For Yabin's use-case, the X86
portions need to be compiled to assembly, or even an object file, while the
PTX portions need to be lowered to an assembly string and embedded in the
X86 source (or written to disk somewhere).  If you're targeting Cell, in
contrast, you'd want to compile both down to object files.

Don't get me wrong, I think this is something we need to do and the
llvm.codegen intrinsic is a band-aid solution, but I don't see this as a
simple problem.

>
> > 3. Generate two independent LLVM-IR files and pass them around
together
> >
> > The host and device LLVM-IR modules could be kept in separate files.
> > This has the benefit of being user readable and not adding additional
> > complexity to the LLVM-IR files itself. However, separate files do not
> > provide information about how those files are related. Which files are
> > kernel files, how.where do they need to be loaded, ...? Also this
> > information could probably be put into meta-data or could be hard
coded
> > into the generic compiler infrastructure, but this would require
> > significant additional code.
>
> I don't think metadata would work because it would not satisfy the
"no
> semantic effects" requirement.  We couldn't just drop the metadata
and
> expect things to work.
>
> > Another weakness of this approach is that the entire LLVM optimization
> > chain is currently built under the assumption that a single
file/module
> > passed around. This is most obvious with the 'opt | llc'
idiom, but in
> > general every tool that does currently exist would need to be adapted
to
> > handle multiple files and would possibly even need semantic knowledge
> > about how to connect/use them together. Just running clang or
> > draggonegg with -load GPGPUOptimizer.so would not be possible.
>
> Again, we have many of the changes to make this possible.  I hope to
> send them for review as we upgrade to 3.1.
>
> > All of the previous approaches require significant changes all over
the
> > code base and would cause trouble with loadable optimization passes.
The
> > intrinsic based approach seems to address most of the previous
problems.
>
> I'm pretty uncomfortable with the proposed intrinsic.  It feels
> tacked-on and not in the LLVM spirit.  We should be able to extend the
> IR to support multiple targets.  We're going to need this kind of
> support for much more than GPUs in thefuture.  Heterogenous computing is
> here to stay.
>
For me, the bigger question is: do we extend the IR to support multiple
targets, or do we keep the one-target-per-module philosophy and derive some
other way of representing how the modules fit together?  I can see pros and
cons for both approaches.

What if instead of per-function annotations, we implement something like
module file sections?  You could organize a module file into logical
sections based on target architecture.  I'm just throwing that out there.

>
>                             -Dave
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>


-- 

Thanks,

Justin Holewinski
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20120430/6649dc66/attachment.html>

dag at cray.com

2012-May-01 15:22 UTC

head link

[LLVMdev] [PATCH][RFC] Add llvm.codegen Intrinsic To Support Embedded LLVM IR Code Generation

Justin Holewinski <justin.holewinski at gmail.com> writes:
>     I don't think the code base changes are all that bad.  We have a
number
>     of them to support generating code one function at a time rather than a
>     whole module together.  They've been sitting around waiting for us
to
>     send them upstream.  It would be an easy matter to simply annotate each
>     function with its target.  We don't currently do that because we
never
>     write out such IR files but it seems like a simple problem to solve to
>     me.
>
> If such changes are almost ready to be up-streamed, then great!
Just to clariofy, the current changes simply allow a function to be
completely processed (including asm generation) before the next function
is sent to codegen.
> It just seems like a fairly non-trivial task to actually implement
> function-level target selection, especially when you consider function
> call semantics, taking the address of a function, etc.
For something like PTX, runtime calls take care of the call semantics so
it is either up to the user or the frontend to set up the runtime calls
correctly.  We don't need to completely solve this problem.  Yet.  :)
> If you have a global variable, what target "sees" it?  Does it
need to
> be annotated along with the function?  
For a tool like llc, wouldn't it be simply a matter of changing
TheTarget and reconstituting the various passes?  The changes we have
waiting to upstream already allow us to reconstitute passes.  I
sometimes use this to turn on/off debugging on a function-level basis.

The way we've constructed our backend interface should just allow us to
switch the target and reinitialize everything.  I'm sure I'm glossing
over tons of details but I don't see a fundamental architectural problem
in LLVM that would prevent this.
> Can functions from two different targets share this pointer?  
Again, in the case of PTX it's the runtime's responsibility to ensure
this.  I agree passing pointers around complicates things in the general
case but I also think it's a solvable problem.
> For Yabin's use-case, the X86 portions need to be compiled to
> assembly, or even an object file, while the PTX portions need to be
> lowered to an assembly string and embedded in the X86 source (or
> written to disk somewhere).  
I think it's just a matter of switching to a different AsmWriter.  The
PTX runtime can load objects from files.  The code doesn't have to be a
string in the x86 object file.
> If you're targeting Cell, in contrast, you'd want to compile both
down
> to object files.
I think we probably want to do that for PTX as well.
> For me, the bigger question is: do we extend the IR to support
> multiple targets, or do we keep the one-target-per-module philosophy
> and derive some other way of representing how the modules fit
> together?  I can see pros and cons for both approaches.
Me too.
> What if instead of per-function annotations, we implement something
> like module file sections?  You could organize a module file into
> logical sections based on target architecture.  I'm just throwing that
> out there.
Do we allow more than one Module per file?  If not, that seems like an
arbitrary limitation.  If we allowed that we could have each module
specify a different target.

                                 -Dave

Tobias Grosser

2012-May-07 08:47 UTC

head link

[LLVMdev] [PATCH][RFC] Add llvm.codegen Intrinsic To Support Embedded LLVM IR Code Generation

On 04/30/2012 09:55 PM, dag at cray.com wrote:> Tobias Grosser<tobias at grosser.es>  writes:
>
>> To write optimizations that yield embedded GPU code, we also looked
into
>> three other approaches:
>>
>> 1. Directly create embedded target code (e.g. PTX)
>>
>> This would mean the optimization pass extracts device code internally
>> and directly generate the relevant target code. This approach would
>> require our generic optimization pass to be directly linked with the
>> specific target back end. This is an ugly layering violation and, in
>> addition, it causes major troubles in case the new optimization should
>> be dynamically loaded.
>
> IMHO it's a bit unrealistic to have a target-independent optimization
> layer.  Almost all optimization wants to know target details at some
> point.  I think we can and probably should support that.  We can allow
> passes to gracefully fall back in the cases where target information is
> not available.
Yes, I agree it makes sense to make target-information available to the 
optimizers. As you noted yourself, this is different to performing 
target code generation in the optimizers.
>> 2. Extend the LLVM-IR files to support heterogeneous modules
>>
>> This would mean we extend LLVM-IR, such that IR for different targets
>> can be stored within a single IR file. This approach could be
integrated
>> nicely into the LLVM code generation flow and would yield readable
>> LLVM-IR even for the device code. However, it adds another level of
>> complexity to the LLVM-IR files and does not only require massive
>> changes in the LLVM code base, but also in compilers built on top of
>> LLVM-IR.
>
> I don't think the code base changes are all that bad.  We have a number
> of them to support generating code one function at a time rather than a
> whole module together.  They've been sitting around waiting for us to
> send them upstream.  It would be an easy matter to simply annotate each
> function with its target.  We don't currently do that because we never
> write out such IR files but it seems like a simple problem to solve to
> me.
Supporting several modules in on LLVM-IR file may not be too difficult,
but getting this in may still be controversial. The large amount of 
changes that I see are changes to the tools. At the moment all tools 
expect a single module coming from an LLVM-IR file. I pointed out the 
problems in llc and the codegen examples in my other mail.
>> 3. Generate two independent LLVM-IR files and pass them around together
>>
>> The host and device LLVM-IR modules could be kept in separate files.
>> This has the benefit of being user readable and not adding additional
>> complexity to the LLVM-IR files itself. However, separate files do not
>> provide information about how those files are related. Which files are
>> kernel files, how.where do they need to be loaded, ...? Also this
>> information could probably be put into meta-data or could be hard coded
>> into the generic compiler infrastructure, but this would require
>> significant additional code.
>
> I don't think metadata would work because it would not satisfy the
"no
> semantic effects" requirement.  We couldn't just drop the metadata
and
> expect things to work.
You are right, this solution requires semantic meta-data which is a 
non-trivial prerequisite.
>> Another weakness of this approach is that the entire LLVM optimization
>> chain is currently built under the assumption that a single file/module
>> passed around. This is most obvious with the 'opt | llc' idiom,
but in
>> general every tool that does currently exist would need to be adapted
to
>> handle multiple files and would possibly even need semantic knowledge
>> about how to connect/use them together. Just running clang or
>> draggonegg with -load GPGPUOptimizer.so would not be possible.
>
> Again, we have many of the changes to make this possible.  I hope to
> send them for review as we upgrade to 3.1.
Could you provide a list of the changes you have in the pipeline and a 
reliable timeline on when you will upstream them? How much additional 
work from other people is required to make this a valuable replacement 
of the llvm.codegen intrinsic?
>> All of the previous approaches require significant changes all over the
>> code base and would cause trouble with loadable optimization passes.
The
>> intrinsic based approach seems to address most of the previous
problems.
>
> I'm pretty uncomfortable with the proposed intrinsic.  It feels
> tacked-on and not in the LLVM spirit.  We should be able to extend the
> IR to support multiple targets.  We're going to need this kind of
> support for much more than GPUs in thefuture.  Heterogenous computing is
> here to stay.
Where exactly do you see problems with this intrinsic? It is not meant 
to block further work in heterogeneous computing, but to allow us to 
gradually improve LLVM to gain such features. It especially provides a 
low overhead solution that adds working heterogeneous compute 
capabilities for major GPU targets to LLVM. This working solution can 
prepare the ground for closer integrated solutions.

Tobi

dag at cray.com

2012-May-07 16:24 UTC

head link

[LLVMdev] [PATCH][RFC] Add llvm.codegen Intrinsic To Support Embedded LLVM IR Code Generation

Tobias Grosser <tobias at grosser.es> writes:
> Supporting several modules in on LLVM-IR file may not be too difficult,
> but getting this in may still be controversial. The large amount of
> changes that I see are changes to the tools. At the moment all tools
> expect a single module coming from an LLVM-IR file. I pointed out the
> problems in llc and the codegen examples in my other mail.
I replied to that mail so I won't repeat it all here.  I don't think
there's any problem given current technology.  Since I don't know any
details (only speculation) about what's coming in the future, I can't
comment beyond that.
>> Again, we have many of the changes to make this possible.  I hope to
>> send them for review as we upgrade to 3.1.
>
> Could you provide a list of the changes you have in the pipeline and a
> reliable timeline on when you will upstream them? How much additional
> work from other people is required to make this a valuable replacement
> of the llvm.codegen intrinsic?
I'll try to recall the major bits.  I did this work 3-4 years ago...

I think the major issue was with the AsmPrinter.  There's global state
kept around that needs to be cleared between invocations.  The
initialization step needs to be re-run for each function but there are
some tricky bits that should not happen each run.  That is, most of
AsmPrinter is idempotent but not all.

Label names are a big issue.  A simple label counter (L0, L1, etc.) is
no longer sufficent because the counter gets reset between invocations
and you end up with multiple labels with the same name in the .s file.
We got around this by including the (mangled) function name in the label
name.

I had to tweak the mangling code a bit so that it would generate valid
label names.  I also consolidated it as there were at least two
different implementations in the ~2.5 codebase.  I don't know if that's
changed.

We don't use much of opt at all.  I'm sure there are some issues with
the interprocedural optimizations.  We didn't deal with those.  All of
our changes are in the llc/codegen piece.

As for getting it upstream, we're moving to 3.1 as soon as it's ready
and my intention is to push as much of our customized code upstream as
possible during that transition.  The above work would be a pretty high
priority as it is a major source of conflicts for us and I'd rather just
git rid of those.  :)

So expect to start seeing something within 1-2 months.  Unfortunately,
we have bureaucratic processes I have to go through here to get stuff
approved for public release.
> Where exactly do you see problems with this intrinsic? It is not meant
> to block further work in heterogeneous computing, but to allow us to
> gradually improve LLVM to gain such features. It especially provides a
> low overhead solution that adds working heterogeneous compute
> capabilities for major GPU targets to LLVM. This working solution can
> prepare the ground for closer integrated solutions.
It feels like a code generator bolted onto the side of opt, llc,
etc. with all of the details that involves.  It seems much easier to me
to just go through the "real" code generator.

                           -Dave

Possibly Parallel Threads

Search for more seemingly similar threads

llvm dev - May 2012 - [LLVMdev] [PATCH][RFC] Add llvm.codegen Intrinsic To Support Embedded LLVM IR Code Generation

[LLVMdev] [PATCH][RFC] Add llvm.codegen Intrinsic To Support Embedded LLVM IR Code Generation

[LLVMdev] [PATCH][RFC] Add llvm.codegen Intrinsic To Support Embedded LLVM IR Code Generation

[LLVMdev] [PATCH][RFC] Add llvm.codegen Intrinsic To Support Embedded LLVM IR Code Generation

[LLVMdev] [PATCH][RFC] Add llvm.codegen Intrinsic To Support Embedded LLVM IR Code Generation

[LLVMdev] [PATCH][RFC] Add llvm.codegen Intrinsic To Support Embedded LLVM IR Code Generation

[LLVMdev] [PATCH][RFC] Add llvm.codegen Intrinsic To Support Embedded LLVM IR Code Generation

Possibly Parallel Threads