thr3ads.net - llvm dev - [LLVMdev] whole program optimization examples? [Oct 2014]

If this information is useful, please help other people find it:
Share via:

Kevin Modzelewski

2014-Oct-13 22:23 UTC

[LLVMdev] whole program optimization examples?

With the patchpoint infrastructure, shouldn't it now be relatively
straightforward to do an accurate-but-non-relocatable scan of the stack, by
attaching all the GC roots as stackmap arguments to patchpoints?  This is
something we're currently working on for Pyston (ie we don't have it
working yet), but I think we might get it "for free" once we finish
the
work on frame introspection.

I'm not aware of any high-performance conservative GC implementations that
are designed to be pluggable (if there are please let us know!) -- they
typically seem pretty integrated with the VMs object model and language
features that need to be supported.  We're spending some time right now to
improve our GC situation, which is "a pain" since it's
more-or-less
reinventing the wheel.  It's not made any harder by LLVM, but it's tough
in
the sense that we're not getting it for free like we would if we were on
something like the JVM.

On Sat, Oct 11, 2014 at 11:37 PM, Filip Pizlo <fpizlo at apple.com> wrote:
>
>
> > On Oct 10, 2014, at 6:24 PM, Hayden Livingston <halivingston at
gmail.com>
> wrote:
> >
> > Hello,
> >
> > I was wondering if there is an example list somewhere of whole program
> optimizations done by LLVM based compilers?
> >
> > I'm only familiar with method-level optimizations, and I'm
being told
> wpo can deliver many great speedups.
> >
> > My language is currently staticly typed JIT based and uses the JVM,
and
> I want to move it over to LLVM so that I can have options where it can be
> ahead of time compiled as well.
>
> As Philip kindly pointed out, WebKit uses llvm as part of a JavaScript JIT
> optimization pipeline. It works well for WebKit, but this was a large
> amount of work. It may not be the path of least resistance depending on
> what your requirements are.
>
> >
> > I'm hearing bad things about LLVM's JIT capabilities --
specifically
> that writing your own GC is going to be a pain.
>
> This is a fun topic and you'll probably get some good advice. :-)
>
> Here's my take. GC in llvm is only a pain if you make the tragic
mistake
> of writing an accurate-on-the-stack GC. Accurate collectors are only known
> to be beneficial in niche environments, usually if you have an aversion to
> probabilistic algorithms. You might also be stuck requiring accuracy if
> your system relies on being able to force *every* object to *immediately*
> move to a new location, but this is an uncommon requirement - usually it
> happens due to certain speculative optimization strategies in dynamic
> languages.
>
> My approach is to use a Bartlett-style mostly-copying collector. If you
> use a Bartlett-style collector then you don't need any special support
in
> llvm. It just works, it allows llvm to register-allocate pointers at will,
> and it lends itself naturally to high-throughput collector algorithms.
> Bartlett-style collectors come in many shapes and sizes - copying or not,
> mark-region or not, generational or not, and even a fancy concurrent
> copying example exists.
>
> WebKit used a Bartlett-style parallel generational sticky-mark copying
> collector with opportunistic mark-region optimizations. We haven't
written
> up anything about it yet but it is all open source.
>
> Hosking's paper about the concurrent variant is here:
> http://dl.acm.org/citation.cfm?doid=1133956.1133963
>
> I highly recommend reading Bartlett's original paper about conservative
> copying; it provides an excellent semi space algorithm that would be a
> respectable starting point for any VM. You won't regret implementing it
-
> it'll simplify your interface to any JIT, not just llvm. It'll also
make
> FFI easy because it allows the C stack to refer directly to GC objects
> without any shenanigans.
>
> Bartlett is probabilistic in the sense that it may, with low probability,
> increase object drag. This happens rarely. On 64-bit systems it's
> especially rare. It's been pretty well demonstrated that Bartlett
> collectors are as fast as accurate ones, insofar as anything in GC land can
> be demonstrated (as in it's still a topic of lively debate, though I
had
> some papers back in the day that showed some comparisons). WebKit often
> wins GC benchmarks for example, and we particularly like that our GC never
> imposes limitations on llvm optimizations. It's really great to be able
to
> view the compiler and the collector as orthogonal components!
>
> >
> > Anyways, sort of diverged there, but still looking for WPO examples!
> >
> > Hayden.
> > _______________________________________________
> > LLVM Developers mailing list
> > LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20141013/1aa5b7f2/attachment.html>

Filip Pizlo

2014-Oct-13 22:42 UTC

head link

[LLVMdev] whole program optimization examples?

> On Oct 13, 2014, at 3:23 PM, Kevin Modzelewski <kmod at dropbox.com>
wrote:
> 
> With the patchpoint infrastructure, shouldn't it now be relatively
straightforward to do an accurate-but-non-relocatable scan of the stack, by
attaching all the GC roots as stackmap arguments to patchpoints?
Yes.
> This is something we're currently working on for Pyston (ie we
don't have it working yet), but I think we might get it "for free"
once we finish the work on frame introspection.
> 
> I'm not aware of any high-performance conservative GC implementations
that are designed to be pluggable (if there are please let us know!) -- they
typically seem pretty integrated with the VMs object model and language features
that need to be supported.
If you want a high-performance GC, you will end up integrating it with the VM’s
object model.
> We're spending some time right now to improve our GC situation, which
is "a pain" since it's more-or-less reinventing the wheel. 
It's not made any harder by LLVM, but it's tough in the sense that
we're not getting it for free like we would if we were on something like the
JVM.
> 
> On Sat, Oct 11, 2014 at 11:37 PM, Filip Pizlo <fpizlo at apple.com
<mailto:fpizlo at apple.com>> wrote:
> 
> 
> > On Oct 10, 2014, at 6:24 PM, Hayden Livingston <halivingston at
gmail.com <mailto:halivingston at gmail.com>> wrote:
> >
> > Hello,
> >
> > I was wondering if there is an example list somewhere of whole program
optimizations done by LLVM based compilers?
> >
> > I'm only familiar with method-level optimizations, and I'm
being told wpo can deliver many great speedups.
> >
> > My language is currently staticly typed JIT based and uses the JVM,
and I want to move it over to LLVM so that I can have options where it can be
ahead of time compiled as well.
> 
> As Philip kindly pointed out, WebKit uses llvm as part of a JavaScript JIT
optimization pipeline. It works well for WebKit, but this was a large amount of
work. It may not be the path of least resistance depending on what your
requirements are.
> 
> >
> > I'm hearing bad things about LLVM's JIT capabilities --
specifically that writing your own GC is going to be a pain.
> 
> This is a fun topic and you'll probably get some good advice. :-)
> 
> Here's my take. GC in llvm is only a pain if you make the tragic
mistake of writing an accurate-on-the-stack GC. Accurate collectors are only
known to be beneficial in niche environments, usually if you have an aversion to
probabilistic algorithms. You might also be stuck requiring accuracy if your
system relies on being able to force *every* object to *immediately* move to a
new location, but this is an uncommon requirement - usually it happens due to
certain speculative optimization strategies in dynamic languages.
> 
> My approach is to use a Bartlett-style mostly-copying collector. If you use
a Bartlett-style collector then you don't need any special support in llvm.
It just works, it allows llvm to register-allocate pointers at will, and it
lends itself naturally to high-throughput collector algorithms. Bartlett-style
collectors come in many shapes and sizes - copying or not, mark-region or not,
generational or not, and even a fancy concurrent copying example exists.
> 
> WebKit used a Bartlett-style parallel generational sticky-mark copying
collector with opportunistic mark-region optimizations. We haven't written
up anything about it yet but it is all open source.
> 
> Hosking's paper about the concurrent variant is here:
http://dl.acm.org/citation.cfm?doid=1133956.1133963
<http://dl.acm.org/citation.cfm?doid=1133956.1133963>
> 
> I highly recommend reading Bartlett's original paper about conservative
copying; it provides an excellent semi space algorithm that would be a
respectable starting point for any VM. You won't regret implementing it -
it'll simplify your interface to any JIT, not just llvm. It'll also make
FFI easy because it allows the C stack to refer directly to GC objects without
any shenanigans.
> 
> Bartlett is probabilistic in the sense that it may, with low probability,
increase object drag. This happens rarely. On 64-bit systems it's especially
rare. It's been pretty well demonstrated that Bartlett collectors are as
fast as accurate ones, insofar as anything in GC land can be demonstrated (as in
it's still a topic of lively debate, though I had some papers back in the
day that showed some comparisons). WebKit often wins GC benchmarks for example,
and we particularly like that our GC never imposes limitations on llvm
optimizations. It's really great to be able to view the compiler and the
collector as orthogonal components!
> 
> >
> > Anyways, sort of diverged there, but still looking for WPO examples!
> >
> > Hayden.
> > _______________________________________________
> > LLVM Developers mailing list
> > LLVMdev at cs.uiuc.edu <mailto:LLVMdev at cs.uiuc.edu>        
http://llvm.cs.uiuc.edu <http://llvm.cs.uiuc.edu/>
> > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
<http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu <mailto:LLVMdev at cs.uiuc.edu>        
http://llvm.cs.uiuc.edu <http://llvm.cs.uiuc.edu/>
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
<http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev>
> 
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20141013/47df9215/attachment.html>

Philip Reames

2014-Oct-13 23:07 UTC

head link

[LLVMdev] whole program optimization examples?

On 10/13/2014 03:23 PM, Kevin Modzelewski wrote:> With the patchpoint infrastructure, shouldn't it now be relatively 
> straightforward to do an accurate-but-non-relocatable scan of the 
> stack, by attaching all the GC roots as stackmap arguments to 
> patchpoints?  This is something we're currently working on for Pyston 
> (ie we don't have it working yet), but I think we might get it
"for
> free" once we finish the work on frame introspection.Take a look at the statepoint intrinsics up for review.  These are 
essentially exactly that, with two extensions:
- A semantic distinction between gc roots and deopt state (since you may 
want both)
- Support for explicit relocation of the gc root values (this could be 
made optional, but is currently not)

Though, you really don't want to emit these in your frontend.  You can, 
it'll work, but the performance will suffer.  Doing so will prevent many 
useful optimizations from running.  Instead, you probably want to 
consider something like the late safepoint placement approach we've been 
pushing.  Hopefully, once the statepoint stuff lands, we can get that 
upstreamed fairly soon.

Philip
>
> On Sat, Oct 11, 2014 at 11:37 PM, Filip Pizlo <fpizlo at apple.com 
> <mailto:fpizlo at apple.com>> wrote:
>
>
>
>     > On Oct 10, 2014, at 6:24 PM, Hayden Livingston
>     <halivingston at gmail.com <mailto:halivingston at
gmail.com>> wrote:
>     >
>     > Hello,
>     >
>     > I was wondering if there is an example list somewhere of whole
>     program optimizations done by LLVM based compilers?
>     >
>     > I'm only familiar with method-level optimizations, and I'm
being
>     told wpo can deliver many great speedups.
>     >
>     > My language is currently staticly typed JIT based and uses the
>     JVM, and I want to move it over to LLVM so that I can have options
>     where it can be ahead of time compiled as well.
>
>     As Philip kindly pointed out, WebKit uses llvm as part of a
>     JavaScript JIT optimization pipeline. It works well for WebKit,
>     but this was a large amount of work. It may not be the path of
>     least resistance depending on what your requirements are.
>
>     >
>     > I'm hearing bad things about LLVM's JIT capabilities --
>     specifically that writing your own GC is going to be a pain.
>
>     This is a fun topic and you'll probably get some good advice. :-)
>
>     Here's my take. GC in llvm is only a pain if you make the tragic
>     mistake of writing an accurate-on-the-stack GC. Accurate
>     collectors are only known to be beneficial in niche environments,
>     usually if you have an aversion to probabilistic algorithms. You
>     might also be stuck requiring accuracy if your system relies on
>     being able to force *every* object to *immediately* move to a new
>     location, but this is an uncommon requirement - usually it happens
>     due to certain speculative optimization strategies in dynamic
>     languages.
>
>     My approach is to use a Bartlett-style mostly-copying collector.
>     If you use a Bartlett-style collector then you don't need any
>     special support in llvm. It just works, it allows llvm to
>     register-allocate pointers at will, and it lends itself naturally
>     to high-throughput collector algorithms. Bartlett-style collectors
>     come in many shapes and sizes - copying or not, mark-region or
>     not, generational or not, and even a fancy concurrent copying
>     example exists.
>
>     WebKit used a Bartlett-style parallel generational sticky-mark
>     copying collector with opportunistic mark-region optimizations. We
>     haven't written up anything about it yet but it is all open source.
>
>     Hosking's paper about the concurrent variant is here:
>     http://dl.acm.org/citation.cfm?doid=1133956.1133963
>
>     I highly recommend reading Bartlett's original paper about
>     conservative copying; it provides an excellent semi space
>     algorithm that would be a respectable starting point for any VM.
>     You won't regret implementing it - it'll simplify your
interface
>     to any JIT, not just llvm. It'll also make FFI easy because it
>     allows the C stack to refer directly to GC objects without any
>     shenanigans.
>
>     Bartlett is probabilistic in the sense that it may, with low
>     probability, increase object drag. This happens rarely. On 64-bit
>     systems it's especially rare. It's been pretty well
demonstrated
>     that Bartlett collectors are as fast as accurate ones, insofar as
>     anything in GC land can be demonstrated (as in it's still a topic
>     of lively debate, though I had some papers back in the day that
>     showed some comparisons). WebKit often wins GC benchmarks for
>     example, and we particularly like that our GC never imposes
>     limitations on llvm optimizations. It's really great to be able to
>     view the compiler and the collector as orthogonal components!
>
>     >
>     > Anyways, sort of diverged there, but still looking for WPO
examples!
>     >
>     > Hayden.
>     > _______________________________________________
>     > LLVM Developers mailing list
>     > LLVMdev at cs.uiuc.edu <mailto:LLVMdev at cs.uiuc.edu>
>     http://llvm.cs.uiuc.edu
>     > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>     _______________________________________________
>     LLVM Developers mailing list
>     LLVMdev at cs.uiuc.edu <mailto:LLVMdev at cs.uiuc.edu>
>     http://llvm.cs.uiuc.edu
>     http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>
>
>
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20141013/02a88660/attachment.html>

Kevin Modzelewski

2014-Oct-14 00:30 UTC

head link

[LLVMdev] whole program optimization examples?

On Mon, Oct 13, 2014 at 3:42 PM, Filip Pizlo <fpizlo at apple.com> wrote:
>
> This is something we're currently working on for Pyston (ie we
don't have
> it working yet), but I think we might get it "for free" once we
finish the
> work on frame introspection.
>
> I'm not aware of any high-performance conservative GC implementations
that
> are designed to be pluggable (if there are please let us know!) -- they
> typically seem pretty integrated with the VMs object model and language
> features that need to be supported.
>
>
> If you want a high-performance GC, you will end up integrating it with the
> VM’s object model.
>
>Sorry, wasn't trying to knock any of the non-pluggable GCs out there, was
just trying to attest to the OP's worry that "writing your own GC is
going
to be a pain" -- even though this is a problem that's been solved
repeatedly and LLVM doesn't necessarily get in your way, there's still a
lot of work to get a good GC (let alone a whole VM) running on top of LLVM.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20141013/96ddfc66/attachment.html>

Filip Pizlo

2014-Oct-14 01:17 UTC

head link

[LLVMdev] whole program optimization examples?

> On Oct 13, 2014, at 4:07 PM, Philip Reames <listmail at
philipreames.com> wrote:
> 
> 
>> On 10/13/2014 03:23 PM, Kevin Modzelewski wrote:
>> With the patchpoint infrastructure, shouldn't it now be relatively
straightforward to do an accurate-but-non-relocatable scan of the stack, by
attaching all the GC roots as stackmap arguments to patchpoints?  This is
something we're currently working on for Pyston (ie we don't have it
working yet), but I think we might get it "for free" once we finish
the work on frame introspection.
> Take a look at the statepoint intrinsics up for review.  These are
essentially exactly that, with two extensions:
> - A semantic distinction between gc roots and deopt state (since you may
want both)
> - Support for explicit relocation of the gc root values (this could be made
optional, but is currently not)
> 
> Though, you really don't want to emit these in your frontend.  You can,
it'll work, but the performance will suffer.  Doing so will prevent many
useful optimizations from running.
You really should be specific here. The optimizations you're thinking of may
be uninteresting to many clients.

Also you won't lose any performance if your GC pointers are also needed for
deopt (which happens to be the common case).

I really do think that this whole discussion is tragicomic. Most clients of LLVM
would be best served with mostly copying GC.

-Filip

> Instead, you probably want to consider something like the late safepoint
placement approach we've been pushing.  Hopefully, once the statepoint stuff
lands, we can get that upstreamed fairly soon.
> 
> Philip
> 
>> 
>> On Sat, Oct 11, 2014 at 11:37 PM, Filip Pizlo <fpizlo at
apple.com> wrote:
>>> 
>>> 
>>> > On Oct 10, 2014, at 6:24 PM, Hayden Livingston
<halivingston at gmail.com> wrote:
>>> >
>>> > Hello,
>>> >
>>> > I was wondering if there is an example list somewhere of whole
program optimizations done by LLVM based compilers?
>>> >
>>> > I'm only familiar with method-level optimizations, and
I'm being told wpo can deliver many great speedups.
>>> >
>>> > My language is currently staticly typed JIT based and uses the
JVM, and I want to move it over to LLVM so that I can have options where it can
be ahead of time compiled as well.
>>> 
>>> As Philip kindly pointed out, WebKit uses llvm as part of a
JavaScript JIT optimization pipeline. It works well for WebKit, but this was a
large amount of work. It may not be the path of least resistance depending on
what your requirements are.
>>> 
>>> >
>>> > I'm hearing bad things about LLVM's JIT capabilities
-- specifically that writing your own GC is going to be a pain.
>>> 
>>> This is a fun topic and you'll probably get some good advice.
:-)
>>> 
>>> Here's my take. GC in llvm is only a pain if you make the
tragic mistake of writing an accurate-on-the-stack GC. Accurate collectors are
only known to be beneficial in niche environments, usually if you have an
aversion to             probabilistic algorithms. You might also be stuck
requiring accuracy if your system relies on being able to force *every* object
to *immediately* move to a new location, but this is an uncommon requirement -
usually it happens due to certain speculative optimization strategies in dynamic
languages.
>>> 
>>> My approach is to use a Bartlett-style mostly-copying collector. If
you use a Bartlett-style collector then you don't need any special support
in llvm. It just works, it allows llvm to register-allocate pointers at will,
and it lends itself naturally to high-throughput collector algorithms.
Bartlett-style collectors come in many shapes and sizes - copying or not,
mark-region or not, generational or not, and even a fancy concurrent copying
example exists.
>>> 
>>> WebKit used a Bartlett-style parallel generational sticky-mark
copying collector with opportunistic mark-region optimizations. We haven't
written up anything about it yet but it is all open source.
>>> 
>>> Hosking's paper about the concurrent variant is here:
http://dl.acm.org/citation.cfm?doid=1133956.1133963
>>> 
>>> I highly recommend reading Bartlett's original paper about
conservative copying; it provides an excellent semi space algorithm that would
be a respectable starting point for any             VM. You won't regret
implementing it - it'll simplify your interface to any JIT, not just llvm.
It'll also make FFI easy because it allows the C stack to refer directly to
GC objects without any shenanigans.
>>> 
>>> Bartlett is probabilistic in the sense that it may, with low
probability, increase object drag. This happens rarely. On 64-bit systems
it's especially rare. It's been pretty well demonstrated that Bartlett
collectors are as fast as accurate ones, insofar as anything in GC land can be
demonstrated (as in it's still a topic of lively debate, though I had some
papers back in the day that showed some comparisons). WebKit often wins GC
benchmarks for example, and we particularly like that our GC never imposes
limitations on llvm optimizations. It's really great to be able to view the
compiler and the collector as orthogonal components!
>>> 
>>> >
>>> > Anyways, sort of diverged there, but still looking for WPO
examples!
>>> >
>>> > Hayden.
>>> > _______________________________________________
>>> > LLVM Developers mailing list
>>> > LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>>> > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>> 
>> 
>> 
>> _______________________________________________
>> LLVM Developers mailing list
>> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
> -------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20141013/3a9add2c/attachment.html>

Apparently Analagous Threads

Search for more seemingly similar threads

llvm dev - Oct 2014 - [LLVMdev] whole program optimization examples?

[LLVMdev] whole program optimization examples?

[LLVMdev] whole program optimization examples?

[LLVMdev] whole program optimization examples?

[LLVMdev] whole program optimization examples?

[LLVMdev] whole program optimization examples?

Apparently Analagous Threads