thr3ads.net - llvm dev - [LLVMdev] whole program optimization examples? [Oct 2014]

If this information is useful, please help other people find it:
Share via:

Hayden Livingston

2014-Oct-11 01:24 UTC

[LLVMdev] whole program optimization examples?

Hello,

I was wondering if there is an example list somewhere of whole program
optimizations done by LLVM based compilers?

I'm only familiar with method-level optimizations, and I'm being told
wpo
can deliver many great speedups.

My language is currently staticly typed JIT based and uses the JVM, and I
want to move it over to LLVM so that I can have options where it can be
ahead of time compiled as well.

I'm hearing bad things about LLVM's JIT capabilities -- specifically
that
writing your own GC is going to be a pain.

Anyways, sort of diverged there, but still looking for WPO examples!

Hayden.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20141010/fdba8ed6/attachment.html>

Renato Golin

2014-Oct-11 15:19 UTC

head link

[LLVMdev] whole program optimization examples?

On 11 October 2014 02:24, Hayden Livingston <halivingston at gmail.com>
wrote:> I'm hearing bad things about LLVM's JIT capabilities --
specifically that
> writing your own GC is going to be a pain.
While VMKit is retired, there are good examples of how to integrate
VMs and GCs into LLVM.

http://vmkit.llvm.org/

I'll let others chime in on the WPO examples. :)

cheers,
--renato

Philip Reames

2014-Oct-11 23:54 UTC

head link

[LLVMdev] whole program optimization examples?

On 10/11/2014 08:19 AM, Renato Golin wrote:> On 11 October 2014 02:24, Hayden Livingston <halivingston at
gmail.com> wrote:
>> I'm hearing bad things about LLVM's JIT capabilities --
specifically that
>> writing your own GC is going to be a pain.
> While VMKit is retired, there are good examples of how to integrate
> VMs and GCs into LLVM.
>
> http://vmkit.llvm.org/A better example would be the FTL project which is now part of WebKit.  
That's a production ready compiler while VMKit is
not.>
> I'll let others chime in on the WPO examples. :)
>
> cheers,
> --renato
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev

Philip Reames

2014-Oct-12 00:15 UTC

head link

[LLVMdev] whole program optimization examples?

On 10/10/2014 06:24 PM, Hayden Livingston wrote:> Hello,
>
> I was wondering if there is an example list somewhere of whole program 
> optimizations done by LLVM based compilers?
>
> I'm only familiar with method-level optimizations, and I'm being
told
> wpo can deliver many great speedups.
>
> My language is currently staticly typed JIT based and uses the JVM, 
> and I want to move it over to LLVM so that I can have options where it 
> can be ahead of time compiled as well.Depending on your use case (and frankly, your budget), you might want to 
consider Azul Zing's ReadyNow features: 
http://www.azulsystems.com/solutions/zing/readynow

This isn't true ahead of time compilation, but it would be a way to get 
most of the benefits of classic ahead of time compilation running on a 
standards compliant JVM.

(Keep in mind, I work for Azul.  I may be slightly biased
here.)>
> I'm hearing bad things about LLVM's JIT capabilities --
specifically
> that writing your own GC is going to be a pain.Out of curiosity, where did you hear this?

We are actively working on improving the state of the world here. I'd 
suggest you take a look at the infrastructure patches currently up for 
review here: http://reviews.llvm.org/D5683

These will hopefully land within a week or two.  At that point, the "gc 
infrastructure" part should be functional.  You'd have to pick a GC 
(LLVM does not provide one), but you're frontend could emit barriers and 
statepoints (gc parseable callsites) and everything should work.  (Well, 
modulo bugs!  Which I want to know about so we can fix.)

There are a couple of options out there for pluggable GC libraries. The 
best well known is Boehm's conservative GC, but there are others.

Once that's in, we're planning on landing all of the late safepoint 
insertion logic we've been working on.  This will enable full 
optimization of code for garbage collected languages - provided you meet 
a few requirements on the input IR.  You can read about it here:
http://www.philipreames.com/Blog/tag/late-safepoint-placement/

And find the (slightly out of date) code here:
https://github.com/AzulSystems/llvm-late-safepoint-placement>
> Anyways, sort of diverged there, but still looking for WPO examples!I'm curious to hear others take here as well.  A few things that jump 
out at me: cross function escape analysis, alias analysis (in support of 
things like LICM), and cross function constant propagation.  Not all of 
these work out of the box, but with work (sometimes on your side, 
sometimes an LLVM patch), interesting results can be had.

Fair warning, while getting an LLVM based JIT up and running at peak 
performance is a worthwhile endeavor (IMHO), it's also a fair amount of 
work.  Getting something functional is relatively straight forward, but 
there's a lot of non-trivial tuning of your generated IR to really 
exploit the power of the optimizers well.   We've talking person years 
of work here.  Most of this is in the performance tuning phase, and 
depending on your point of comparison, it may be an easier or harder 
problem.  Essentially, the closer to C performance your current runtime 
is, the harder you'll have to work.  Getting 1/10 of C performance with 
an untuned LLVM based JIT is pretty easy; the closer you get to C (or 
JVM) performance the harder it gets.

(Disclaimer: This is me speaking off the top of my head.  Take 
everything I just said with a grain of salt.)

Philip

Hayden Livingston

2014-Oct-12 02:20 UTC

head link

[LLVMdev] whole program optimization examples?

Thanks, Philip for the "lay of the ground" picture. I think the
situation
I'm in, which represents my employment (and now personal technical
curiosity) is that we're seeing LLVM implementations show up like every
other week or month, etc. and people are asking us, "well this mathematical
software of yours is great, but my engineer here tells me it's not using
this LLVM thing, and I think we're wasting cloud compute resources on by
using the JVM technology" -- this is how non-tech people are talking to me
about this :-)
I heard the LLVM JIT situation from a bunch of my friends, one of whom was
part of the Unladen Swallow effort and basically said -- "Trust me,
it's
not going to work, I put 2 years of my life every single day into it".

But honestly, I personally am not familiar with writing a GC or what
necessarily entails -- I want to, and I can pick it up, but I spent most of
time writing JVM based tooling, profilers, and byte code cachers, etc.

With regards to ReadyNow, I think at least someone on my team was looking
at it.

In any case, I'll be following your blog closely now!

On Sat, Oct 11, 2014 at 5:15 PM, Philip Reames <listmail at
philipreames.com>
wrote:
> On 10/10/2014 06:24 PM, Hayden Livingston wrote:
>
>> Hello,
>>
>> I was wondering if there is an example list somewhere of whole program
>> optimizations done by LLVM based compilers?
>>
>> I'm only familiar with method-level optimizations, and I'm
being told wpo
>> can deliver many great speedups.
>>
>> My language is currently staticly typed JIT based and uses the JVM, and
I
>> want to move it over to LLVM so that I can have options where it can be
>> ahead of time compiled as well.
>>
> Depending on your use case (and frankly, your budget), you might want to
> consider Azul Zing's ReadyNow features: http://www.azulsystems.com/
> solutions/zing/readynow
>
> This isn't true ahead of time compilation, but it would be a way to get
> most of the benefits of classic ahead of time compilation running on a
> standards compliant JVM.
>
> (Keep in mind, I work for Azul.  I may be slightly biased here.)
>
>>
>> I'm hearing bad things about LLVM's JIT capabilities --
specifically that
>> writing your own GC is going to be a pain.
>>
> Out of curiosity, where did you hear this?
>
> We are actively working on improving the state of the world here. I'd
> suggest you take a look at the infrastructure patches currently up for
> review here: http://reviews.llvm.org/D5683
>
> These will hopefully land within a week or two.  At that point, the
"gc
> infrastructure" part should be functional.  You'd have to pick a
GC (LLVM
> does not provide one), but you're frontend could emit barriers and
> statepoints (gc parseable callsites) and everything should work.  (Well,
> modulo bugs!  Which I want to know about so we can fix.)
>
> There are a couple of options out there for pluggable GC libraries. The
> best well known is Boehm's conservative GC, but there are others.
>
> Once that's in, we're planning on landing all of the late safepoint
> insertion logic we've been working on.  This will enable full
optimization
> of code for garbage collected languages - provided you meet a few
> requirements on the input IR.  You can read about it here:
> http://www.philipreames.com/Blog/tag/late-safepoint-placement/
>
> And find the (slightly out of date) code here:
> https://github.com/AzulSystems/llvm-late-safepoint-placement
>
>>
>> Anyways, sort of diverged there, but still looking for WPO examples!
>>
> I'm curious to hear others take here as well.  A few things that jump
out
> at me: cross function escape analysis, alias analysis (in support of things
> like LICM), and cross function constant propagation.  Not all of these work
> out of the box, but with work (sometimes on your side, sometimes an LLVM
> patch), interesting results can be had.
>
> Fair warning, while getting an LLVM based JIT up and running at peak
> performance is a worthwhile endeavor (IMHO), it's also a fair amount of
> work.  Getting something functional is relatively straight forward, but
> there's a lot of non-trivial tuning of your generated IR to really
exploit
> the power of the optimizers well.   We've talking person years of work
> here.  Most of this is in the performance tuning phase, and depending on
> your point of comparison, it may be an easier or harder problem.
> Essentially, the closer to C performance your current runtime is, the
> harder you'll have to work.  Getting 1/10 of C performance with an
untuned
> LLVM based JIT is pretty easy; the closer you get to C (or JVM) performance
> the harder it gets.
>
> (Disclaimer: This is me speaking off the top of my head.  Take everything
> I just said with a grain of salt.)
>
> Philip
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20141011/809e8009/attachment.html>

Filip Pizlo

2014-Oct-12 06:37 UTC

head link

[LLVMdev] whole program optimization examples?

> On Oct 10, 2014, at 6:24 PM, Hayden Livingston <halivingston at
gmail.com> wrote:
> 
> Hello,
> 
> I was wondering if there is an example list somewhere of whole program
optimizations done by LLVM based compilers?
> 
> I'm only familiar with method-level optimizations, and I'm being
told wpo can deliver many great speedups.
> 
> My language is currently staticly typed JIT based and uses the JVM, and I
want to move it over to LLVM so that I can have options where it can be ahead of
time compiled as well.
As Philip kindly pointed out, WebKit uses llvm as part of a JavaScript JIT
optimization pipeline. It works well for WebKit, but this was a large amount of
work. It may not be the path of least resistance depending on what your
requirements are.
> 
> I'm hearing bad things about LLVM's JIT capabilities --
specifically that writing your own GC is going to be a pain.
This is a fun topic and you'll probably get some good advice. :-)

Here's my take. GC in llvm is only a pain if you make the tragic mistake of
writing an accurate-on-the-stack GC. Accurate collectors are only known to be
beneficial in niche environments, usually if you have an aversion to
probabilistic algorithms. You might also be stuck requiring accuracy if your
system relies on being able to force *every* object to *immediately* move to a
new location, but this is an uncommon requirement - usually it happens due to
certain speculative optimization strategies in dynamic languages.

My approach is to use a Bartlett-style mostly-copying collector. If you use a
Bartlett-style collector then you don't need any special support in llvm. It
just works, it allows llvm to register-allocate pointers at will, and it lends
itself naturally to high-throughput collector algorithms. Bartlett-style
collectors come in many shapes and sizes - copying or not, mark-region or not,
generational or not, and even a fancy concurrent copying example exists.

WebKit used a Bartlett-style parallel generational sticky-mark copying collector
with opportunistic mark-region optimizations. We haven't written up anything
about it yet but it is all open source.

Hosking's paper about the concurrent variant is here:
http://dl.acm.org/citation.cfm?doid=1133956.1133963

I highly recommend reading Bartlett's original paper about conservative
copying; it provides an excellent semi space algorithm that would be a
respectable starting point for any VM. You won't regret implementing it -
it'll simplify your interface to any JIT, not just llvm. It'll also make
FFI easy because it allows the C stack to refer directly to GC objects without
any shenanigans.

Bartlett is probabilistic in the sense that it may, with low probability,
increase object drag. This happens rarely. On 64-bit systems it's especially
rare. It's been pretty well demonstrated that Bartlett collectors are as
fast as accurate ones, insofar as anything in GC land can be demonstrated (as in
it's still a topic of lively debate, though I had some papers back in the
day that showed some comparisons). WebKit often wins GC benchmarks for example,
and we particularly like that our GC never imposes limitations on llvm
optimizations. It's really great to be able to view the compiler and the
collector as orthogonal components!
> 
> Anyways, sort of diverged there, but still looking for WPO examples!
> 
> Hayden.
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev

Philip Reames

2014-Oct-13 21:50 UTC

head link

[LLVMdev] whole program optimization examples?

On 10/11/2014 11:37 PM, Filip Pizlo wrote:>
>> On Oct 10, 2014, at 6:24 PM, Hayden Livingston <halivingston at
gmail.com> wrote:
>>
>> Hello,
>>
>> I was wondering if there is an example list somewhere of whole program
optimizations done by LLVM based compilers?
>>
>> I'm only familiar with method-level optimizations, and I'm
being told wpo can deliver many great speedups.
>>
>> My language is currently staticly typed JIT based and uses the JVM, and
I want to move it over to LLVM so that I can have options where it can be ahead
of time compiled as well.
> As Philip kindly pointed out, WebKit uses llvm as part of a JavaScript JIT
optimization pipeline. It works well for WebKit, but this was a large amount of
work. It may not be the path of least resistance depending on what your
requirements are.
>
>> I'm hearing bad things about LLVM's JIT capabilities --
specifically that writing your own GC is going to be a pain.
> This is a fun topic and you'll probably get some good advice. :-)
>
> Here's my take. GC in llvm is only a pain if you make the tragic
mistake of writing an accurate-on-the-stack GC. Accurate collectors are only
known to be beneficial in niche environments, usually if you have an aversion to
probabilistic algorithms. You might also be stuck requiring accuracy if your
system relies on being able to force *every* object to *immediately* move to a
new location, but this is an uncommon requirement - usually it happens due to
certain speculative optimization strategies in dynamic languages.I disagree with Filip in the details here.  I don't believe that precise 
collectors are always better - they're not! - but the set of 
circumstances they're helpful for is larger than Filip admits.

However, we've had that discussion on the mailing list before, and it's 
far enough off topic to not be worth repeating in detail. 
:)>
> My approach is to use a Bartlett-style mostly-copying collector. If you use
a Bartlett-style collector then you don't need any special support in llvm.
It just works, it allows llvm to register-allocate pointers at will, and it
lends itself naturally to high-throughput collector algorithms. Bartlett-style
collectors come in many shapes and sizes - copying or not, mark-region or not,
generational or not, and even a fancy concurrent copying example exists.
>
> WebKit used a Bartlett-style parallel generational sticky-mark copying
collector with opportunistic mark-region optimizations. We haven't written
up anything about it yet but it is all open source.
>
> Hosking's paper about the concurrent variant is here:
http://dl.acm.org/citation.cfm?doid=1133956.1133963
>
> I highly recommend reading Bartlett's original paper about conservative
copying; it provides an excellent semi space algorithm that would be a
respectable starting point for any VM. You won't regret implementing it -
it'll simplify your interface to any JIT, not just llvm. It'll also make
FFI easy because it allows the C stack to refer directly to GC objects without
any shenanigans.
>
> Bartlett is probabilistic in the sense that it may, with low probability,
increase object drag. This happens rarely. On 64-bit systems it's especially
rare. It's been pretty well demonstrated that Bartlett collectors are as
fast as accurate ones, insofar as anything in GC land can be demonstrated (as in
it's still a topic of lively debate, though I had some papers back in the
day that showed some comparisons). WebKit often wins GC benchmarks for example,
and we particularly like that our GC never imposes limitations on llvm
optimizations. It's really great to be able to view the compiler and the
collector as orthogonal components!
>
>> Anyways, sort of diverged there, but still looking for WPO examples!
>>
>> Hayden.
>> _______________________________________________
>> LLVM Developers mailing list
>> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev

Kevin Modzelewski

2014-Oct-13 22:23 UTC

head link

[LLVMdev] whole program optimization examples?

With the patchpoint infrastructure, shouldn't it now be relatively
straightforward to do an accurate-but-non-relocatable scan of the stack, by
attaching all the GC roots as stackmap arguments to patchpoints?  This is
something we're currently working on for Pyston (ie we don't have it
working yet), but I think we might get it "for free" once we finish
the
work on frame introspection.

I'm not aware of any high-performance conservative GC implementations that
are designed to be pluggable (if there are please let us know!) -- they
typically seem pretty integrated with the VMs object model and language
features that need to be supported.  We're spending some time right now to
improve our GC situation, which is "a pain" since it's
more-or-less
reinventing the wheel.  It's not made any harder by LLVM, but it's tough
in
the sense that we're not getting it for free like we would if we were on
something like the JVM.

On Sat, Oct 11, 2014 at 11:37 PM, Filip Pizlo <fpizlo at apple.com> wrote:
>
>
> > On Oct 10, 2014, at 6:24 PM, Hayden Livingston <halivingston at
gmail.com>
> wrote:
> >
> > Hello,
> >
> > I was wondering if there is an example list somewhere of whole program
> optimizations done by LLVM based compilers?
> >
> > I'm only familiar with method-level optimizations, and I'm
being told
> wpo can deliver many great speedups.
> >
> > My language is currently staticly typed JIT based and uses the JVM,
and
> I want to move it over to LLVM so that I can have options where it can be
> ahead of time compiled as well.
>
> As Philip kindly pointed out, WebKit uses llvm as part of a JavaScript JIT
> optimization pipeline. It works well for WebKit, but this was a large
> amount of work. It may not be the path of least resistance depending on
> what your requirements are.
>
> >
> > I'm hearing bad things about LLVM's JIT capabilities --
specifically
> that writing your own GC is going to be a pain.
>
> This is a fun topic and you'll probably get some good advice. :-)
>
> Here's my take. GC in llvm is only a pain if you make the tragic
mistake
> of writing an accurate-on-the-stack GC. Accurate collectors are only known
> to be beneficial in niche environments, usually if you have an aversion to
> probabilistic algorithms. You might also be stuck requiring accuracy if
> your system relies on being able to force *every* object to *immediately*
> move to a new location, but this is an uncommon requirement - usually it
> happens due to certain speculative optimization strategies in dynamic
> languages.
>
> My approach is to use a Bartlett-style mostly-copying collector. If you
> use a Bartlett-style collector then you don't need any special support
in
> llvm. It just works, it allows llvm to register-allocate pointers at will,
> and it lends itself naturally to high-throughput collector algorithms.
> Bartlett-style collectors come in many shapes and sizes - copying or not,
> mark-region or not, generational or not, and even a fancy concurrent
> copying example exists.
>
> WebKit used a Bartlett-style parallel generational sticky-mark copying
> collector with opportunistic mark-region optimizations. We haven't
written
> up anything about it yet but it is all open source.
>
> Hosking's paper about the concurrent variant is here:
> http://dl.acm.org/citation.cfm?doid=1133956.1133963
>
> I highly recommend reading Bartlett's original paper about conservative
> copying; it provides an excellent semi space algorithm that would be a
> respectable starting point for any VM. You won't regret implementing it
-
> it'll simplify your interface to any JIT, not just llvm. It'll also
make
> FFI easy because it allows the C stack to refer directly to GC objects
> without any shenanigans.
>
> Bartlett is probabilistic in the sense that it may, with low probability,
> increase object drag. This happens rarely. On 64-bit systems it's
> especially rare. It's been pretty well demonstrated that Bartlett
> collectors are as fast as accurate ones, insofar as anything in GC land can
> be demonstrated (as in it's still a topic of lively debate, though I
had
> some papers back in the day that showed some comparisons). WebKit often
> wins GC benchmarks for example, and we particularly like that our GC never
> imposes limitations on llvm optimizations. It's really great to be able
to
> view the compiler and the collector as orthogonal components!
>
> >
> > Anyways, sort of diverged there, but still looking for WPO examples!
> >
> > Hayden.
> > _______________________________________________
> > LLVM Developers mailing list
> > LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20141013/1aa5b7f2/attachment.html>

Reasonably Related Threads

Search for more possibly parallel threads

llvm dev - Oct 2014 - [LLVMdev] whole program optimization examples?

[LLVMdev] whole program optimization examples?

[LLVMdev] whole program optimization examples?

[LLVMdev] whole program optimization examples?

[LLVMdev] whole program optimization examples?

[LLVMdev] whole program optimization examples?

[LLVMdev] whole program optimization examples?

[LLVMdev] whole program optimization examples?

[LLVMdev] whole program optimization examples?

Reasonably Related Threads