On Sun, Aug 8, 2010 at 8:20 PM, Jakob Stoklund Olesen <stoklund at 2pi.dk> wrote:> > On Aug 7, 2010, at 7:05 PM, Steven Noonan wrote: >> I've been doing work on memory reduction in Unladen Swallow, and >> during testing, LiveRanges seemed to be consuming one of the largest >> chunks of memory. > > That's interesting. How did you measure this? I'd love to see your data. > > Note that the LiveRange struct is allocated by a plain std::vector, and your patch doesn't change that. I assume you are talking about the VNInfo structs?Steven has been using Instruments, and sending us screenshots. Does anyone else know a better way of exporting that data? I thought I dug into the register allocation code, and found the VNInfo::Allocator typedef. I assumed that was getting the traffic we saw in Instruments, but I don't have the data to back that up.>> I wrote a replacement allocator for use by >> BumpPtrAllocator which uses mmap()/munmap() in place of >> malloc()/free(). > > It's a bit more complicated than that. Modern malloc's use a whole bag of tricks to avoid lock contention in multiprocessor systems, and they know which allocation size the kernel likes, and which system calls to use. > > By calling mmap directly, you are throwing all that system specific knowledge away.So the goal of this particular modification was to find ways to return large, one-time allocations that happen during compilation back the OS. For unladen-swallow, we have a long-lived Python process where we JIT code every so often. We happen to generate an ungodly amount of code, which we're trying to reduce. However, this means that LLVM allocates a lot of memory for us, and it grows our heap by several MB over what it would normally be. The breakdown was roughly 8 MB gets allocated for this one compilation in the spam_bayes benchmark, with 2 MB coming form register allocation and 2 MB from SDNodes. We are looking at using mmap/munmap to avoid permanently growing the heap. This patch switches all allocators over to mmap, so you can see a lot of "stitches" in the graphs, where an allocator is created and thrown away quickly. Those allocations are probably better served by malloc.> It's great that you provide measurements, but it's not clear what you are measuring. Does 'mem max' include the overhead of asking the kernel for tiny 4K allocations, if any? Also, what is your operating system and architecture? That could make a big difference.The memory size measurements in this data are all taken using /proc/smaps data in Linux to find the number of dirty pages.> Have you looked at the effect of twiddling the default 4K slab size in BumpPtrAllocator? I suspect you could get more dramatic results that way.If we did that, one thing that might happen is malloc might start forwarding to mmap, but I think you have to allocate ~128K at a time to hit that threshold. Reid
On Aug 8, 2010, at 9:20 PM, Reid Kleckner wrote:> I thought I dug into the register allocation code, and found the > VNInfo::Allocator typedef. I assumed that was getting the traffic we > saw in Instruments, but I don't have the data to back that up.Are you using llvm from trunk? VNInfo is a lot smaller now than it was in 2.7. I would guess about a third of the liveness memory usage goes through the VNInfo BumpPtrAllocator. [...]>> By calling mmap directly, you are throwing all that system specific knowledge away. > > So the goal of this particular modification was to find ways to return > large, one-time allocations that happen during compilation back the > OS. For unladen-swallow, we have a long-lived Python process where we > JIT code every so often. We happen to generate an ungodly amount of > code, which we're trying to reduce. However, this means that LLVM > allocates a lot of memory for us, and it grows our heap by several MB > over what it would normally be. The breakdown was roughly 8 MB gets > allocated for this one compilation in the spam_bayes benchmark, with 2 > MB coming form register allocation and 2 MB from SDNodes. > > We are looking at using mmap/munmap to avoid permanently growing the heap.Don't try to outsmart malloc, you are going to lose ;-) This all depends on specific kernel implementation details, but you risk badly fragmenting your address space, and chances are the kernel is not going to handle that well. You are using the kernel as a memory manager, but the kernel wants to be used as a dumb slab allocator for malloc. I assume that LLVM is properly freeing memory after jitting? Otherwise, that should be looked at. So why isn't your malloc returning the memory to the OS? Is it because malloc thinks you might be needing that memory soon anyway? Is it correct? Does your malloc know that you are running with very little memory, and the system badly needs those 8MB? Maybe your malloc needs to be tuned for a small device? Is LLVM leaving a fragmented heap behind. Why? That would be worth looking into. /jakob
On Mon, 9 Aug 2010 09:36:53 -0700 Jakob Stoklund Olesen <stoklund at 2pi.dk> wrote:> > On Aug 8, 2010, at 9:20 PM, Reid Kleckner wrote: > > > I thought I dug into the register allocation code, and found the > > VNInfo::Allocator typedef. I assumed that was getting the traffic > > we saw in Instruments, but I don't have the data to back that up. > > Are you using llvm from trunk? VNInfo is a lot smaller now than it > was in 2.7. I would guess about a third of the liveness memory usage > goes through the VNInfo BumpPtrAllocator. > > [...] > > >> By calling mmap directly, you are throwing all that system > >> specific knowledge away. > > > > So the goal of this particular modification was to find ways to > > return large, one-time allocations that happen during compilation > > back the OS. For unladen-swallow, we have a long-lived Python > > process where we JIT code every so often. We happen to generate an > > ungodly amount of code, which we're trying to reduce. However, > > this means that LLVM allocates a lot of memory for us, and it grows > > our heap by several MB over what it would normally be. The > > breakdown was roughly 8 MB gets allocated for this one compilation > > in the spam_bayes benchmark, with 2 MB coming form register > > allocation and 2 MB from SDNodes. > > > > We are looking at using mmap/munmap to avoid permanently growing > > the heap. > > Don't try to outsmart malloc, you are going to lose ;-) > > This all depends on specific kernel implementation details, but you > risk badly fragmenting your address space, and chances are the kernel > is not going to handle that well. You are using the kernel as a > memory manager, but the kernel wants to be used as a dumb slab > allocator for malloc. > > I assume that LLVM is properly freeing memory after jitting? > Otherwise, that should be looked at. > > So why isn't your malloc returning the memory to the OS? > > Is it because malloc thinks you might be needing that memory soon > anyway? Is it correct? > > Does your malloc know that you are running with very little memory, > and the system badly needs those 8MB? Maybe your malloc needs to be > tuned for a small device? > > Is LLVM leaving a fragmented heap behind.With mmap() it is always possible to fully release the memory once you are done using it. With malloc() no, it takes just 1 allocation at the end of the heap to keep all the rest allocated. That wouldn't be a problem if libc would use mmap() as the low-level allocator for malloc but it doesn't. It uses sbrk() mostly for small (<128k) allocations, and even with mmaps it caches them for a while. I think that is because mmap() is slow in multithreaded apps, since it needs to take a process level lock, which also contends with the lock taken by pagefaults from other existing mmaps (in fact that lock is held during disk I/O!). Best regards, --Edwin
On Sun, Aug 8, 2010 at 9:20 PM, Reid Kleckner <reid.kleckner at gmail.com>wrote:> On Sun, Aug 8, 2010 at 8:20 PM, Jakob Stoklund Olesen <stoklund at 2pi.dk> > wrote: > > > > On Aug 7, 2010, at 7:05 PM, Steven Noonan wrote: > >> I've been doing work on memory reduction in Unladen Swallow, and > >> during testing, LiveRanges seemed to be consuming one of the largest > >> chunks of memory. > > > > That's interesting. How did you measure this? I'd love to see your data. > > > > Note that the LiveRange struct is allocated by a plain std::vector, and > your patch doesn't change that. I assume you are talking about the VNInfo > structs? > > Steven has been using Instruments, and sending us screenshots. Does > anyone else know a better way of exporting that data? >So, just so you're aware, direct calls to mmap are not intercepted and reported by Instruments. So using mmap instead of malloc will make your _reported_ numbers go down, but that doesn't necessarily mean you have better performance. This is a problem for people doing performance measurements on Mac OS X and iOS, because exotic memory allocation schemes seem to be becoming more common (I hope not because they dodge reporting!). In particular, may image buffers are allocated directly from mmap and vm_allocate, within CoreGraphics and elsewhere. -Ken Cocoa Frameworks> > I thought I dug into the register allocation code, and found the > VNInfo::Allocator typedef. I assumed that was getting the traffic we > saw in Instruments, but I don't have the data to back that up. > > >> I wrote a replacement allocator for use by > >> BumpPtrAllocator which uses mmap()/munmap() in place of > >> malloc()/free(). > > > > It's a bit more complicated than that. Modern malloc's use a whole bag of > tricks to avoid lock contention in multiprocessor systems, and they know > which allocation size the kernel likes, and which system calls to use. > > > > By calling mmap directly, you are throwing all that system specific > knowledge away. > > So the goal of this particular modification was to find ways to return > large, one-time allocations that happen during compilation back the > OS. For unladen-swallow, we have a long-lived Python process where we > JIT code every so often. We happen to generate an ungodly amount of > code, which we're trying to reduce. However, this means that LLVM > allocates a lot of memory for us, and it grows our heap by several MB > over what it would normally be. The breakdown was roughly 8 MB gets > allocated for this one compilation in the spam_bayes benchmark, with 2 > MB coming form register allocation and 2 MB from SDNodes. > > We are looking at using mmap/munmap to avoid permanently growing the heap. > > This patch switches all allocators over to mmap, so you can see a lot > of "stitches" in the graphs, where an allocator is created and thrown > away quickly. Those allocations are probably better served by malloc. > > > It's great that you provide measurements, but it's not clear what you are > measuring. Does 'mem max' include the overhead of asking the kernel for tiny > 4K allocations, if any? Also, what is your operating system and > architecture? That could make a big difference. > > The memory size measurements in this data are all taken using > /proc/smaps data in Linux to find the number of dirty pages. > > > Have you looked at the effect of twiddling the default 4K slab size in > BumpPtrAllocator? I suspect you could get more dramatic results that way. > > If we did that, one thing that might happen is malloc might start > forwarding to mmap, but I think you have to allocate ~128K at a time > to hit that threshold. > > Reid > > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20100809/358c4e40/attachment.html>
On Mon, Aug 9, 2010 at 1:42 PM, Ken Ferry <kenferry at gmail.com> wrote:> On Sun, Aug 8, 2010 at 9:20 PM, Reid Kleckner <reid.kleckner at gmail.com> > wrote: >> >> On Sun, Aug 8, 2010 at 8:20 PM, Jakob Stoklund Olesen <stoklund at 2pi.dk> >> wrote: >> > >> > On Aug 7, 2010, at 7:05 PM, Steven Noonan wrote: >> >> I've been doing work on memory reduction in Unladen Swallow, and >> >> during testing, LiveRanges seemed to be consuming one of the largest >> >> chunks of memory. >> > >> > That's interesting. How did you measure this? I'd love to see your data. >> > >> > Note that the LiveRange struct is allocated by a plain std::vector, and >> > your patch doesn't change that. I assume you are talking about the VNInfo >> > structs? >> >> Steven has been using Instruments, and sending us screenshots. Does >> anyone else know a better way of exporting that data? > > So, just so you're aware, direct calls to mmap are not intercepted and > reported by Instruments. So using mmap instead of malloc will make your > _reported_ numbers go down, but that doesn't necessarily mean you have > better performance.I am aware. We used Instruments mostly to drill down and find the places that were doing the allocation. The graphs generated by perf.py use dirty pages.> This is a problem for people doing performance measurements on Mac OS X and > iOS, because exotic memory allocation schemes seem to be becoming more > common (I hope not because they dodge reporting!). In particular, may image > buffers are allocated directly from mmap and vm_allocate, within > CoreGraphics and elsewhere.Yeah, it is kind of annoying that by doing this, we make it harder to use Instruments to find problems. :-/ Reid
Le 9 août 2010 à 22:42, Ken Ferry a écrit :> > > On Sun, Aug 8, 2010 at 9:20 PM, Reid Kleckner <reid.kleckner at gmail.com> wrote: > On Sun, Aug 8, 2010 at 8:20 PM, Jakob Stoklund Olesen <stoklund at 2pi.dk> wrote: > > > > On Aug 7, 2010, at 7:05 PM, Steven Noonan wrote: > >> I've been doing work on memory reduction in Unladen Swallow, and > >> during testing, LiveRanges seemed to be consuming one of the largest > >> chunks of memory. > > > > That's interesting. How did you measure this? I'd love to see your data. > > > > Note that the LiveRange struct is allocated by a plain std::vector, and your patch doesn't change that. I assume you are talking about the VNInfo structs? > > Steven has been using Instruments, and sending us screenshots. Does > anyone else know a better way of exporting that data? > > So, just so you're aware, direct calls to mmap are not intercepted and reported by Instruments. So using mmap instead of malloc will make your _reported_ numbers go down, but that doesn't necessarily mean you have better performance. > > This is a problem for people doing performance measurements on Mac OS X and iOS, because exotic memory allocation schemes seem to be becoming more common (I hope not because they dodge reporting!). In particular, may image buffers are allocated directly from mmap and vm_allocate, within CoreGraphics and elsewhere.If this is for Mac OS X, you can use malloc zone instead. They also provide a way to dealloc all memory at once, and they probably works with Instrument too. -- Jean-Daniel -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20100810/df9b8e69/attachment.html>