Bruce Hoult via llvm-dev
2016-Oct-19 19:16 UTC
[llvm-dev] IntrusiveRefCntPtr vs std::shared_ptr
On Wed, Oct 19, 2016 at 9:31 PM, Mehdi Amini via llvm-dev < llvm-dev at lists.llvm.org> wrote:> > On Oct 19, 2016, at 11:14 AM, Bruce Hoult via llvm-dev < > llvm-dev at lists.llvm.org> wrote: > > On Wed, Oct 19, 2016 at 6:24 PM, Benjamin Kramer via llvm-dev < > llvm-dev at lists.llvm.org> wrote: > >> In terms of performance shared_ptr has a number of disadvantages. One >> is that it always uses atomics even though most IntrusiveRefCntPtrs >> are used in single-threaded contexts. Another is weak_ptr adding a lot >> of complexity to the implementation, IntrusiveRefCntPtr doesn't >> support weak references. >> >> With that it's hard to make a case for changing uses of >> IntrusiveRefCntPtr as it's a non-trivial amount of work >> (IntrusiveRefCntPtr binds the reference count to the object itself, >> shared_ptr doesn't. Figuring out when a value held by an >> IntrusiveRefCntPtr is passed around by raw pointer and stuffed into >> another IntrusiveRefCntPtr is hard) with potential negative >> performance impact. >> > > In terms of performance, the whole concept has a number of disavantages :-) > > I recently tried an experiment. I compiled a 40000 line C file > (concatenated all the files of a project together) to .bc with clang, and > then ran llc on it. I tried it on both Ubuntu 16.04 x64 and on an Odroid > XU-4 ARM board. with very similar results. > > I made a tiny library with a 1 GB static char array. I made a malloc() > that simply bumped a pointer (prepending a 32 bit object size, just for > realloc(), grrrrrr kill it with fire), and a free() that is an empty > function. There's a calloc() that calls the above malloc() and then > memset(). And a realloc() that is a no-op if the size is smaller, or does > malloc(), memcpy() if bigger. > > Then I used LD_PRELOAD to replace the standard malloc library with mine. > > Result: ~10% faster execution than llc without LD_PRELOAD, and ~180 MB of > the array used (120 MB on the 32 bit ARM). > > Then I built BDW GC as a malloc replacement (with free() as a no-op) and > used LD_PRELOAD with it. > > Result: ~20% faster execution than llc without LD_PRELOAD, and ~10 MB of > RAM used. > > In this experiment all the reference counting in IntrusiveRefCntPtr > or shared_ptr or whatever still takes place, the same as before. But at the > end, when it decides to call free, it's a no-op. So all the > reference-counting machinery is a complete waste of time and code and RAM > and the program would run strictly faster if it was ripped out. > > > I may miss something in your description, but it seems like you’re never > releasing memory? I’m not sure I follow how is it a good thing? >I did two different tests. In the first test I never released memory. The compiler allocated 120 - 180 MB of total memory compiling a 40000 line C file. Typical C files are much smaller that this, so it's potentially a valid strategy if you make a new invocation of the compile for every C file. However, it was mostly just for statistics-gathering purposes. In the second test I used a GC. I never released memory, but it was collected when objects were no longer reachable.> Also what about destructor? >Stack-based objects would still have destructors called, heap based objects will not. As 99% of destructors only deal with releasing other memory owned by the object anyway, this is not important. Some destructors may be closing files or something like that. I didn't notice problems. The compiler ran fine in both cases, and produced asm output identical to running it normally. This is just an experiment. Obviously, if someone were to decide to replace explicit memory management with GC in the llvm project then some real work would be required to audit the code and find any issues. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20161019/b532c876/attachment.html>
Mehdi Amini via llvm-dev
2016-Oct-19 19:19 UTC
[llvm-dev] IntrusiveRefCntPtr vs std::shared_ptr
> On Oct 19, 2016, at 12:16 PM, Bruce Hoult <bruce at hoult.org> wrote: > > On Wed, Oct 19, 2016 at 9:31 PM, Mehdi Amini via llvm-dev <llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>> wrote: > >> On Oct 19, 2016, at 11:14 AM, Bruce Hoult via llvm-dev <llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>> wrote: >> >> On Wed, Oct 19, 2016 at 6:24 PM, Benjamin Kramer via llvm-dev <llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>> wrote: >> In terms of performance shared_ptr has a number of disadvantages. One >> is that it always uses atomics even though most IntrusiveRefCntPtrs >> are used in single-threaded contexts. Another is weak_ptr adding a lot >> of complexity to the implementation, IntrusiveRefCntPtr doesn't >> support weak references. >> >> With that it's hard to make a case for changing uses of >> IntrusiveRefCntPtr as it's a non-trivial amount of work >> (IntrusiveRefCntPtr binds the reference count to the object itself, >> shared_ptr doesn't. Figuring out when a value held by an >> IntrusiveRefCntPtr is passed around by raw pointer and stuffed into >> another IntrusiveRefCntPtr is hard) with potential negative >> performance impact. >> >> In terms of performance, the whole concept has a number of disavantages :-) >> >> I recently tried an experiment. I compiled a 40000 line C file (concatenated all the files of a project together) to .bc with clang, and then ran llc on it. I tried it on both Ubuntu 16.04 x64 and on an Odroid XU-4 ARM board. with very similar results. >> >> I made a tiny library with a 1 GB static char array. I made a malloc() that simply bumped a pointer (prepending a 32 bit object size, just for realloc(), grrrrrr kill it with fire), and a free() that is an empty function. There's a calloc() that calls the above malloc() and then memset(). And a realloc() that is a no-op if the size is smaller, or does malloc(), memcpy() if bigger. >> >> Then I used LD_PRELOAD to replace the standard malloc library with mine. >> >> Result: ~10% faster execution than llc without LD_PRELOAD, and ~180 MB of the array used (120 MB on the 32 bit ARM). >> >> Then I built BDW GC as a malloc replacement (with free() as a no-op) and used LD_PRELOAD with it. >> >> Result: ~20% faster execution than llc without LD_PRELOAD, and ~10 MB of RAM used. >> >> In this experiment all the reference counting in IntrusiveRefCntPtr or shared_ptr or whatever still takes place, the same as before. But at the end, when it decides to call free, it's a no-op. So all the reference-counting machinery is a complete waste of time and code and RAM and the program would run strictly faster if it was ripped out. > > I may miss something in your description, but it seems like you’re never releasing memory? I’m not sure I follow how is it a good thing? > > I did two different tests. > > In the first test I never released memory. The compiler allocated 120 - 180 MB of total memory compiling a 40000 line C file. Typical C files are much smaller that this, so it's potentially a valid strategy if you make a new invocation of the compile for every C file. However, it was mostly just for statistics-gathering purposes. > > In the second test I used a GC. I never released memory, but it was collected when objects were no longer reachable.OK I see. How do you explain that the GC allocation provides a 10% speedup over the simple “bump ptr allocator” (if I understand your results correctly). — Mehdi> > > Also what about destructor? > > Stack-based objects would still have destructors called, heap based objects will not. As 99% of destructors only deal with releasing other memory owned by the object anyway, this is not important. > > Some destructors may be closing files or something like that. I didn't notice problems. The compiler ran fine in both cases, and produced asm output identical to running it normally. > > This is just an experiment. Obviously, if someone were to decide to replace explicit memory management with GC in the llvm project then some real work would be required to audit the code and find any issues.-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20161019/03fb772b/attachment.html>
Bruce Hoult via llvm-dev
2016-Oct-19 19:21 UTC
[llvm-dev] IntrusiveRefCntPtr vs std::shared_ptr
Locality of reference (largely fitting into L3 cache), and not having to produce a large number of demand-zero CoW VM pages from the OS. On Wed, Oct 19, 2016 at 10:19 PM, Mehdi Amini via llvm-dev < llvm-dev at lists.llvm.org> wrote:> > On Oct 19, 2016, at 12:16 PM, Bruce Hoult <bruce at hoult.org> wrote: > > On Wed, Oct 19, 2016 at 9:31 PM, Mehdi Amini via llvm-dev < > llvm-dev at lists.llvm.org> wrote: > >> >> On Oct 19, 2016, at 11:14 AM, Bruce Hoult via llvm-dev < >> llvm-dev at lists.llvm.org> wrote: >> >> On Wed, Oct 19, 2016 at 6:24 PM, Benjamin Kramer via llvm-dev < >> llvm-dev at lists.llvm.org> wrote: >> >>> In terms of performance shared_ptr has a number of disadvantages. One >>> is that it always uses atomics even though most IntrusiveRefCntPtrs >>> are used in single-threaded contexts. Another is weak_ptr adding a lot >>> of complexity to the implementation, IntrusiveRefCntPtr doesn't >>> support weak references. >>> >>> With that it's hard to make a case for changing uses of >>> IntrusiveRefCntPtr as it's a non-trivial amount of work >>> (IntrusiveRefCntPtr binds the reference count to the object itself, >>> shared_ptr doesn't. Figuring out when a value held by an >>> IntrusiveRefCntPtr is passed around by raw pointer and stuffed into >>> another IntrusiveRefCntPtr is hard) with potential negative >>> performance impact. >>> >> >> In terms of performance, the whole concept has a number of disavantages >> :-) >> >> I recently tried an experiment. I compiled a 40000 line C file >> (concatenated all the files of a project together) to .bc with clang, and >> then ran llc on it. I tried it on both Ubuntu 16.04 x64 and on an Odroid >> XU-4 ARM board. with very similar results. >> >> I made a tiny library with a 1 GB static char array. I made a malloc() >> that simply bumped a pointer (prepending a 32 bit object size, just for >> realloc(), grrrrrr kill it with fire), and a free() that is an empty >> function. There's a calloc() that calls the above malloc() and then >> memset(). And a realloc() that is a no-op if the size is smaller, or does >> malloc(), memcpy() if bigger. >> >> Then I used LD_PRELOAD to replace the standard malloc library with mine. >> >> Result: ~10% faster execution than llc without LD_PRELOAD, and ~180 MB of >> the array used (120 MB on the 32 bit ARM). >> >> Then I built BDW GC as a malloc replacement (with free() as a no-op) and >> used LD_PRELOAD with it. >> >> Result: ~20% faster execution than llc without LD_PRELOAD, and ~10 MB of >> RAM used. >> >> In this experiment all the reference counting in IntrusiveRefCntPtr >> or shared_ptr or whatever still takes place, the same as before. But at the >> end, when it decides to call free, it's a no-op. So all the >> reference-counting machinery is a complete waste of time and code and RAM >> and the program would run strictly faster if it was ripped out. >> >> >> I may miss something in your description, but it seems like you’re never >> releasing memory? I’m not sure I follow how is it a good thing? >> > > I did two different tests. > > In the first test I never released memory. The compiler allocated 120 - > 180 MB of total memory compiling a 40000 line C file. Typical C files are > much smaller that this, so it's potentially a valid strategy if you make a > new invocation of the compile for every C file. However, it was mostly just > for statistics-gathering purposes. > > In the second test I used a GC. I never released memory, but it was > collected when objects were no longer reachable. > > > OK I see. > How do you explain that the GC allocation provides a 10% speedup over the > simple “bump ptr allocator” (if I understand your results correctly). > > — > Mehdi > > > > > >> Also what about destructor? >> > > Stack-based objects would still have destructors called, heap based > objects will not. As 99% of destructors only deal with releasing other > memory owned by the object anyway, this is not important. > > Some destructors may be closing files or something like that. I didn't > notice problems. The compiler ran fine in both cases, and produced asm > output identical to running it normally. > > This is just an experiment. Obviously, if someone were to decide to > replace explicit memory management with GC in the llvm project then some > real work would be required to audit the code and find any issues. > > > > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20161019/00fb518f/attachment.html>