thr3ads.net - llvm dev - [LLVMdev] MmapAllocator [Aug 2010]

If this information is useful, please help other people find it:
Share via:

Reid Kleckner

2010-Aug-09 04:20 UTC

[LLVMdev] MmapAllocator

On Sun, Aug 8, 2010 at 8:20 PM, Jakob Stoklund Olesen <stoklund at 2pi.dk>
wrote:>
> On Aug 7, 2010, at 7:05 PM, Steven Noonan wrote:
>> I've been doing work on memory reduction in Unladen Swallow, and
>> during testing, LiveRanges seemed to be consuming one of the largest
>> chunks of memory.
>
> That's interesting. How did you measure this? I'd love to see your
data.
>
> Note that the LiveRange struct is allocated by a plain std::vector, and
your patch doesn't change that. I assume you are talking about the VNInfo
structs?
Steven has been using Instruments, and sending us screenshots.  Does
anyone else know a better way of exporting that data?

I thought I dug into the register allocation code, and found the
VNInfo::Allocator typedef.  I assumed that was getting the traffic we
saw in Instruments, but I don't have the data to back that up.
>> I wrote a replacement allocator for use by
>> BumpPtrAllocator which uses mmap()/munmap() in place of
>> malloc()/free().
>
> It's a bit more complicated than that. Modern malloc's use a whole
bag of tricks to avoid lock contention in multiprocessor systems, and they know
which allocation size the kernel likes, and which system calls to use.
>
> By calling mmap directly, you are throwing all that system specific
knowledge away.
So the goal of this particular modification was to find ways to return
large, one-time allocations that happen during compilation back the
OS.  For unladen-swallow, we have a long-lived Python process where we
JIT code every so often.  We happen to generate an ungodly amount of
code, which we're trying to reduce.  However, this means that LLVM
allocates a lot of memory for us, and it grows our heap by several MB
over what it would normally be.  The breakdown was roughly 8 MB gets
allocated for this one compilation in the spam_bayes benchmark, with 2
MB coming form register allocation and 2 MB from SDNodes.

We are looking at using mmap/munmap to avoid permanently growing the heap.

This patch switches all allocators over to mmap, so you can see a lot
of "stitches" in the graphs, where an allocator is created and thrown
away quickly.  Those allocations are probably better served by malloc.
> It's great that you provide measurements, but it's not clear what
you are measuring. Does 'mem max' include the overhead of asking the
kernel for tiny 4K allocations, if any? Also, what is your operating system and
architecture? That could make a big difference.
The memory size measurements in this data are all taken using
/proc/smaps data in Linux to find the number of dirty pages.
> Have you looked at the effect of twiddling the default 4K slab size in
BumpPtrAllocator? I suspect you could get more dramatic results that way.
If we did that, one thing that might happen is malloc might start
forwarding to mmap, but I think you have to allocate ~128K at a time
to hit that threshold.

Reid

Jakob Stoklund Olesen

2010-Aug-09 16:36 UTC

head link

[LLVMdev] MmapAllocator

On Aug 8, 2010, at 9:20 PM, Reid Kleckner wrote:
> I thought I dug into the register allocation code, and found the
> VNInfo::Allocator typedef.  I assumed that was getting the traffic we
> saw in Instruments, but I don't have the data to back that up.
Are you using llvm from trunk? VNInfo is a lot smaller now than it was in 2.7. I
would guess about a third of the liveness memory usage goes through the VNInfo
BumpPtrAllocator.

[...]
>> By calling mmap directly, you are throwing all that system specific
knowledge away.
> 
> So the goal of this particular modification was to find ways to return
> large, one-time allocations that happen during compilation back the
> OS.  For unladen-swallow, we have a long-lived Python process where we
> JIT code every so often.  We happen to generate an ungodly amount of
> code, which we're trying to reduce.  However, this means that LLVM
> allocates a lot of memory for us, and it grows our heap by several MB
> over what it would normally be.  The breakdown was roughly 8 MB gets
> allocated for this one compilation in the spam_bayes benchmark, with 2
> MB coming form register allocation and 2 MB from SDNodes.
> 
> We are looking at using mmap/munmap to avoid permanently growing the heap.
Don't try to outsmart malloc, you are going to lose ;-)

This all depends on specific kernel implementation details, but you risk badly
fragmenting your address space, and chances are the kernel is not going to
handle that well. You are using the kernel as a memory manager, but the kernel
wants to be used as a dumb slab allocator for malloc.

I assume that LLVM is properly freeing memory after jitting? Otherwise, that
should be looked at.

So why isn't your malloc returning the memory to the OS?

Is it because malloc thinks you might be needing that memory soon anyway? Is it
correct?

Does your malloc know that you are running with very little memory, and the
system badly needs those 8MB? Maybe your malloc needs to be tuned for a small
device?

Is LLVM leaving a fragmented heap behind. Why? That would be worth looking into.

/jakob

Török Edwin

2010-Aug-09 16:54 UTC

head link

[LLVMdev] MmapAllocator

On Mon, 9 Aug 2010 09:36:53 -0700
Jakob Stoklund Olesen <stoklund at 2pi.dk> wrote:
> 
> On Aug 8, 2010, at 9:20 PM, Reid Kleckner wrote:
> 
> > I thought I dug into the register allocation code, and found the
> > VNInfo::Allocator typedef.  I assumed that was getting the traffic
> > we saw in Instruments, but I don't have the data to back that up.
> 
> Are you using llvm from trunk? VNInfo is a lot smaller now than it
> was in 2.7. I would guess about a third of the liveness memory usage
> goes through the VNInfo BumpPtrAllocator.
> 
> [...]
> 
> >> By calling mmap directly, you are throwing all that system
> >> specific knowledge away.
> > 
> > So the goal of this particular modification was to find ways to
> > return large, one-time allocations that happen during compilation
> > back the OS.  For unladen-swallow, we have a long-lived Python
> > process where we JIT code every so often.  We happen to generate an
> > ungodly amount of code, which we're trying to reduce.  However,
> > this means that LLVM allocates a lot of memory for us, and it grows
> > our heap by several MB over what it would normally be.  The
> > breakdown was roughly 8 MB gets allocated for this one compilation
> > in the spam_bayes benchmark, with 2 MB coming form register
> > allocation and 2 MB from SDNodes.
> > 
> > We are looking at using mmap/munmap to avoid permanently growing
> > the heap.
> 
> Don't try to outsmart malloc, you are going to lose ;-)
> 
> This all depends on specific kernel implementation details, but you
> risk badly fragmenting your address space, and chances are the kernel
> is not going to handle that well. You are using the kernel as a
> memory manager, but the kernel wants to be used as a dumb slab
> allocator for malloc.
> 
> I assume that LLVM is properly freeing memory after jitting?
> Otherwise, that should be looked at.
> 
> So why isn't your malloc returning the memory to the OS?
> 
> Is it because malloc thinks you might be needing that memory soon
> anyway? Is it correct?
> 
> Does your malloc know that you are running with very little memory,
> and the system badly needs those 8MB? Maybe your malloc needs to be
> tuned for a small device?
> 
> Is LLVM leaving a fragmented heap behind. 
With mmap() it is always possible to fully release the memory once you
are done using it.
With malloc() no, it takes just 1 allocation at the end of the heap to
keep all the rest allocated. That wouldn't be a problem if libc would
use mmap() as the low-level allocator for malloc but it doesn't.
It uses sbrk() mostly for small (<128k) allocations, and even with
mmaps it caches them for a while.

I think that is because mmap() is slow in multithreaded apps, since it
needs to take a process level lock, which also contends with the lock
taken by pagefaults from other existing mmaps (in fact that lock is held
during disk I/O!).

Best regards,
--Edwin

Ken Ferry

2010-Aug-09 20:42 UTC

head link

[LLVMdev] MmapAllocator

On Sun, Aug 8, 2010 at 9:20 PM, Reid Kleckner <reid.kleckner at
gmail.com>wrote:
> On Sun, Aug 8, 2010 at 8:20 PM, Jakob Stoklund Olesen <stoklund at
2pi.dk>
> wrote:
> >
> > On Aug 7, 2010, at 7:05 PM, Steven Noonan wrote:
> >> I've been doing work on memory reduction in Unladen Swallow,
and
> >> during testing, LiveRanges seemed to be consuming one of the
largest
> >> chunks of memory.
> >
> > That's interesting. How did you measure this? I'd love to see
your data.
> >
> > Note that the LiveRange struct is allocated by a plain std::vector,
and
> your patch doesn't change that. I assume you are talking about the
VNInfo
> structs?
>
> Steven has been using Instruments, and sending us screenshots.  Does
> anyone else know a better way of exporting that data?
>
So, just so you're aware, direct calls to mmap are not intercepted and
reported by Instruments.  So using mmap instead of malloc will make your
_reported_ numbers go down, but that doesn't necessarily mean you have
better performance.

This is a problem for people doing performance measurements on Mac OS X and
iOS, because exotic memory allocation schemes seem to be becoming more
common (I hope not because they dodge reporting!).  In particular, may image
buffers are allocated directly from mmap and vm_allocate, within
CoreGraphics and elsewhere.

-Ken
Cocoa Frameworks

>
> I thought I dug into the register allocation code, and found the
> VNInfo::Allocator typedef.  I assumed that was getting the traffic we
> saw in Instruments, but I don't have the data to back that up.
>
> >> I wrote a replacement allocator for use by
> >> BumpPtrAllocator which uses mmap()/munmap() in place of
> >> malloc()/free().
> >
> > It's a bit more complicated than that. Modern malloc's use a
whole bag of
> tricks to avoid lock contention in multiprocessor systems, and they know
> which allocation size the kernel likes, and which system calls to use.
> >
> > By calling mmap directly, you are throwing all that system specific
> knowledge away.
>
> So the goal of this particular modification was to find ways to return
> large, one-time allocations that happen during compilation back the
> OS.  For unladen-swallow, we have a long-lived Python process where we
> JIT code every so often.  We happen to generate an ungodly amount of
> code, which we're trying to reduce.  However, this means that LLVM
> allocates a lot of memory for us, and it grows our heap by several MB
> over what it would normally be.  The breakdown was roughly 8 MB gets
> allocated for this one compilation in the spam_bayes benchmark, with 2
> MB coming form register allocation and 2 MB from SDNodes.
>
> We are looking at using mmap/munmap to avoid permanently growing the heap.
>
> This patch switches all allocators over to mmap, so you can see a lot
> of "stitches" in the graphs, where an allocator is created and
thrown
> away quickly.  Those allocations are probably better served by malloc.
>
> > It's great that you provide measurements, but it's not clear
what you are
> measuring. Does 'mem max' include the overhead of asking the kernel
for tiny
> 4K allocations, if any? Also, what is your operating system and
> architecture? That could make a big difference.
>
> The memory size measurements in this data are all taken using
> /proc/smaps data in Linux to find the number of dirty pages.
>
> > Have you looked at the effect of twiddling the default 4K slab size in
> BumpPtrAllocator? I suspect you could get more dramatic results that way.
>
> If we did that, one thing that might happen is malloc might start
> forwarding to mmap, but I think you have to allocate ~128K at a time
> to hit that threshold.
>
> Reid
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20100809/358c4e40/attachment.html>

Reid Kleckner

2010-Aug-09 21:05 UTC

head link

[LLVMdev] MmapAllocator

On Mon, Aug 9, 2010 at 1:42 PM, Ken Ferry <kenferry at gmail.com>
wrote:> On Sun, Aug 8, 2010 at 9:20 PM, Reid Kleckner <reid.kleckner at
gmail.com>
> wrote:
>>
>> On Sun, Aug 8, 2010 at 8:20 PM, Jakob Stoklund Olesen <stoklund at
2pi.dk>
>> wrote:
>> >
>> > On Aug 7, 2010, at 7:05 PM, Steven Noonan wrote:
>> >> I've been doing work on memory reduction in Unladen
Swallow, and
>> >> during testing, LiveRanges seemed to be consuming one of the
largest
>> >> chunks of memory.
>> >
>> > That's interesting. How did you measure this? I'd love to
see your data.
>> >
>> > Note that the LiveRange struct is allocated by a plain
std::vector, and
>> > your patch doesn't change that. I assume you are talking about
the VNInfo
>> > structs?
>>
>> Steven has been using Instruments, and sending us screenshots.  Does
>> anyone else know a better way of exporting that data?
>
> So, just so you're aware, direct calls to mmap are not intercepted and
> reported by Instruments.  So using mmap instead of malloc will make your
> _reported_ numbers go down, but that doesn't necessarily mean you have
> better performance.
I am aware.  We used Instruments mostly to drill down and find the
places that were doing the allocation.  The graphs generated by
perf.py use dirty pages.
> This is a problem for people doing performance measurements on Mac OS X and
> iOS, because exotic memory allocation schemes seem to be becoming more
> common (I hope not because they dodge reporting!).  In particular, may
image
> buffers are allocated directly from mmap and vm_allocate, within
> CoreGraphics and elsewhere.
Yeah, it is kind of annoying that by doing this, we make it harder to
use Instruments to find problems.  :-/

Reid

Jean-Daniel Dupas

2010-Aug-10 07:11 UTC

head link

[LLVMdev] MmapAllocator

Le 9 août 2010 à 22:42, Ken Ferry a écrit :
> 
> 
> On Sun, Aug 8, 2010 at 9:20 PM, Reid Kleckner <reid.kleckner at
gmail.com> wrote:
> On Sun, Aug 8, 2010 at 8:20 PM, Jakob Stoklund Olesen <stoklund at
2pi.dk> wrote:
> >
> > On Aug 7, 2010, at 7:05 PM, Steven Noonan wrote:
> >> I've been doing work on memory reduction in Unladen Swallow,
and
> >> during testing, LiveRanges seemed to be consuming one of the
largest
> >> chunks of memory.
> >
> > That's interesting. How did you measure this? I'd love to see
your data.
> >
> > Note that the LiveRange struct is allocated by a plain std::vector,
and your patch doesn't change that. I assume you are talking about the
VNInfo structs?
> 
> Steven has been using Instruments, and sending us screenshots.  Does
> anyone else know a better way of exporting that data?
> 
> So, just so you're aware, direct calls to mmap are not intercepted and
reported by Instruments.  So using mmap instead of malloc will make your
_reported_ numbers go down, but that doesn't necessarily mean you have
better performance.
> 
> This is a problem for people doing performance measurements on Mac OS X and
iOS, because exotic memory allocation schemes seem to be becoming more common (I
hope not because they dodge reporting!).  In particular, may image buffers are
allocated directly from mmap and vm_allocate, within CoreGraphics and elsewhere.
If this is for Mac OS X, you can use malloc zone instead. They also provide a
way to dealloc all memory at once, and they probably works with Instrument too.

-- Jean-Daniel




-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20100810/df9b8e69/attachment.html>

Maybe Matching Threads

Search for more possibly parallel threads

llvm dev - Aug 2010 - [LLVMdev] MmapAllocator

[LLVMdev] MmapAllocator

[LLVMdev] MmapAllocator

[LLVMdev] MmapAllocator

[LLVMdev] MmapAllocator

[LLVMdev] MmapAllocator

[LLVMdev] MmapAllocator

Maybe Matching Threads