thr3ads.net - llvm dev - [llvm-dev] IntrusiveRefCntPtr vs std::shared

If this information is useful, please help other people find it:
Share via:

Bruce Hoult via llvm-dev

2016-Oct-19 18:14 UTC

[llvm-dev] IntrusiveRefCntPtr vs std::shared_ptr

On Wed, Oct 19, 2016 at 6:24 PM, Benjamin Kramer via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
> In terms of performance shared_ptr has a number of disadvantages. One
> is that it always uses atomics even though most IntrusiveRefCntPtrs
> are used in single-threaded contexts. Another is weak_ptr adding a lot
> of complexity to the implementation, IntrusiveRefCntPtr doesn't
> support weak references.
>
> With that it's hard to make a case for changing uses of
> IntrusiveRefCntPtr as it's a non-trivial amount of work
> (IntrusiveRefCntPtr binds the reference count to the object itself,
> shared_ptr doesn't. Figuring out when a value held by an
> IntrusiveRefCntPtr is passed around by raw pointer and stuffed into
> another IntrusiveRefCntPtr is hard) with potential negative
> performance impact.
>
In terms of performance, the whole concept has a number of disavantages :-)

I recently tried an experiment. I compiled a 40000 line C file
(concatenated all the files of a project together) to .bc with clang, and
then ran llc on it. I tried it on both Ubuntu 16.04 x64 and on an Odroid
XU-4 ARM board. with very similar results.

I made a tiny library with a 1 GB static char array. I made a malloc() that
simply bumped a pointer (prepending a 32 bit object size, just for
realloc(), grrrrrr kill it with fire), and a free() that is an empty
function. There's a calloc() that calls the above malloc() and then
memset(). And a realloc() that is a no-op if the size is smaller, or does
malloc(), memcpy() if bigger.

Then I used LD_PRELOAD to replace the standard malloc library with mine.

Result: ~10% faster execution than llc without LD_PRELOAD, and ~180 MB of
the array used (120 MB on the 32 bit ARM).

Then I built BDW GC as a malloc replacement (with free() as a no-op) and
used LD_PRELOAD with it.

Result: ~20% faster execution than llc without LD_PRELOAD, and ~10 MB of
RAM used.

In this experiment all the reference counting in IntrusiveRefCntPtr
or shared_ptr or whatever still takes place, the same as before. But at the
end, when it decides to call free, it's a no-op. So all the
reference-counting machinery is a complete waste of time and code and RAM
and the program would run strictly faster if it was ripped out.

I don't know for sure (it's a lot more work to try!), but I would not be
surprised to see a further 10%-20% speedup.

And then you come to the cognitive load on the programmer, trying to decide
whether to use IntrusiveRefCntPtr or shared_ptr or unique_ptr or auto_ptr
or weak_ptr or whether and where to call free()/delete. And the extra
typing needed to write it instead of using a raw pointer. And the extra
time and cognitive load to read the code. And for what?
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20161019/a6bb426f/attachment.html>

Mehdi Amini via llvm-dev

2016-Oct-19 18:31 UTC

head link

[llvm-dev] IntrusiveRefCntPtr vs std::shared_ptr

> On Oct 19, 2016, at 11:14 AM, Bruce Hoult via llvm-dev <llvm-dev at
lists.llvm.org> wrote:
> 
> On Wed, Oct 19, 2016 at 6:24 PM, Benjamin Kramer via llvm-dev <llvm-dev
at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>> wrote:
> In terms of performance shared_ptr has a number of disadvantages. One
> is that it always uses atomics even though most IntrusiveRefCntPtrs
> are used in single-threaded contexts. Another is weak_ptr adding a lot
> of complexity to the implementation, IntrusiveRefCntPtr doesn't
> support weak references.
> 
> With that it's hard to make a case for changing uses of
> IntrusiveRefCntPtr as it's a non-trivial amount of work
> (IntrusiveRefCntPtr binds the reference count to the object itself,
> shared_ptr doesn't. Figuring out when a value held by an
> IntrusiveRefCntPtr is passed around by raw pointer and stuffed into
> another IntrusiveRefCntPtr is hard) with potential negative
> performance impact.
>  
> In terms of performance, the whole concept has a number of disavantages :-)
> 
> I recently tried an experiment. I compiled a 40000 line C file
(concatenated all the files of a project together) to .bc with clang, and then
ran llc on it. I tried it on both Ubuntu 16.04 x64 and on an Odroid XU-4 ARM
board. with very similar results.
> 
> I made a tiny library with a 1 GB static char array. I made a malloc() that
simply bumped a pointer (prepending a 32 bit object size, just for realloc(),
grrrrrr kill it with fire), and a free() that is an empty function. There's
a calloc() that calls the above malloc() and then memset(). And a realloc() that
is a no-op if the size is smaller, or does malloc(), memcpy() if bigger.
> 
> Then I used LD_PRELOAD to replace the standard malloc library with mine.
> 
> Result: ~10% faster execution than llc without LD_PRELOAD, and ~180 MB of
the array used (120 MB on the 32 bit ARM).
> 
> Then I built BDW GC as a malloc replacement (with free() as a no-op) and
used LD_PRELOAD with it.
> 
> Result: ~20% faster execution than llc without LD_PRELOAD, and ~10 MB of
RAM used.
> 
> In this experiment all the reference counting in IntrusiveRefCntPtr or
shared_ptr or whatever still takes place, the same as before. But at the end,
when it decides to call free, it's a no-op. So all the reference-counting
machinery is a complete waste of time and code and RAM and the program would run
strictly faster if it was ripped out.
I may miss something in your description, but it seems like you’re never
releasing memory? I’m not sure I follow how is it a good thing?
Also what about destructor?

— 
Mehdi


> 
> I don't know for sure (it's a lot more work to try!), but I would
not be surprised to see a further 10%-20% speedup.
> 
> 
> And then you come to the cognitive load on the programmer, trying to decide
whether to use IntrusiveRefCntPtr or shared_ptr or unique_ptr or auto_ptr or
weak_ptr or whether and where to call free()/delete. And the extra typing needed
to write it instead of using a raw pointer. And the extra time and cognitive
load to read the code. And for what?
> 
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20161019/de8ff93a/attachment.html>

David Jones via llvm-dev

2016-Oct-19 19:06 UTC

head link

[llvm-dev] IntrusiveRefCntPtr vs std::shared_ptr

Out of curiosity, when you did these tests, did you compile your test
program with -O2?

I am working on a large codebase that uses intrusive reference counting.
I'm finding that without -O (e.g. build for easy debug) it's quite slow,
but with -O2 it's maybe 4x faster. This indicates that optimization can
clean up a lot of the overhead of reference counting.

Whether or not a GC such as BDWGC can help you depends on how much "memory
churn" your program does. Again, I've found that my software tends to
allocate and free quite a lot of memory, so if I used GC, then I'd spend
25%+ time doing the GC. On the other hand, if you do more computing than
allocating, then you may come out ahead - GC has zero overhead when not
actually allocating, whereas reference-counting can have overhead when
traversing a data structure even if you malloc/free nothing.

As always, the particular application is what matters, and your mileage may
vary.


On Wed, Oct 19, 2016 at 2:14 PM, Bruce Hoult via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
> On Wed, Oct 19, 2016 at 6:24 PM, Benjamin Kramer via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
>> In terms of performance shared_ptr has a number of disadvantages. One
>> is that it always uses atomics even though most IntrusiveRefCntPtrs
>> are used in single-threaded contexts. Another is weak_ptr adding a lot
>> of complexity to the implementation, IntrusiveRefCntPtr doesn't
>> support weak references.
>>
>> With that it's hard to make a case for changing uses of
>> IntrusiveRefCntPtr as it's a non-trivial amount of work
>> (IntrusiveRefCntPtr binds the reference count to the object itself,
>> shared_ptr doesn't. Figuring out when a value held by an
>> IntrusiveRefCntPtr is passed around by raw pointer and stuffed into
>> another IntrusiveRefCntPtr is hard) with potential negative
>> performance impact.
>>
>
> In terms of performance, the whole concept has a number of disavantages :-)
>
> I recently tried an experiment. I compiled a 40000 line C file
> (concatenated all the files of a project together) to .bc with clang, and
> then ran llc on it. I tried it on both Ubuntu 16.04 x64 and on an Odroid
> XU-4 ARM board. with very similar results.
>
> I made a tiny library with a 1 GB static char array. I made a malloc()
> that simply bumped a pointer (prepending a 32 bit object size, just for
> realloc(), grrrrrr kill it with fire), and a free() that is an empty
> function. There's a calloc() that calls the above malloc() and then
> memset(). And a realloc() that is a no-op if the size is smaller, or does
> malloc(), memcpy() if bigger.
>
> Then I used LD_PRELOAD to replace the standard malloc library with mine.
>
> Result: ~10% faster execution than llc without LD_PRELOAD, and ~180 MB of
> the array used (120 MB on the 32 bit ARM).
>
> Then I built BDW GC as a malloc replacement (with free() as a no-op) and
> used LD_PRELOAD with it.
>
> Result: ~20% faster execution than llc without LD_PRELOAD, and ~10 MB of
> RAM used.
>
> In this experiment all the reference counting in IntrusiveRefCntPtr
> or shared_ptr or whatever still takes place, the same as before. But at the
> end, when it decides to call free, it's a no-op. So all the
> reference-counting machinery is a complete waste of time and code and RAM
> and the program would run strictly faster if it was ripped out.
>
> I don't know for sure (it's a lot more work to try!), but I would
not be
> surprised to see a further 10%-20% speedup.
>
>
> And then you come to the cognitive load on the programmer, trying to
> decide whether to use IntrusiveRefCntPtr or shared_ptr or unique_ptr or
> auto_ptr or weak_ptr or whether and where to call free()/delete. And the
> extra typing needed to write it instead of using a raw pointer. And the
> extra time and cognitive load to read the code. And for what?
>
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20161019/518e1e51/attachment.html>

Bruce Hoult via llvm-dev

2016-Oct-19 19:16 UTC

head link

[llvm-dev] IntrusiveRefCntPtr vs std::shared_ptr

On Wed, Oct 19, 2016 at 9:31 PM, Mehdi Amini via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
>
> On Oct 19, 2016, at 11:14 AM, Bruce Hoult via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
> On Wed, Oct 19, 2016 at 6:24 PM, Benjamin Kramer via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
>> In terms of performance shared_ptr has a number of disadvantages. One
>> is that it always uses atomics even though most IntrusiveRefCntPtrs
>> are used in single-threaded contexts. Another is weak_ptr adding a lot
>> of complexity to the implementation, IntrusiveRefCntPtr doesn't
>> support weak references.
>>
>> With that it's hard to make a case for changing uses of
>> IntrusiveRefCntPtr as it's a non-trivial amount of work
>> (IntrusiveRefCntPtr binds the reference count to the object itself,
>> shared_ptr doesn't. Figuring out when a value held by an
>> IntrusiveRefCntPtr is passed around by raw pointer and stuffed into
>> another IntrusiveRefCntPtr is hard) with potential negative
>> performance impact.
>>
>
> In terms of performance, the whole concept has a number of disavantages :-)
>
> I recently tried an experiment. I compiled a 40000 line C file
> (concatenated all the files of a project together) to .bc with clang, and
> then ran llc on it. I tried it on both Ubuntu 16.04 x64 and on an Odroid
> XU-4 ARM board. with very similar results.
>
> I made a tiny library with a 1 GB static char array. I made a malloc()
> that simply bumped a pointer (prepending a 32 bit object size, just for
> realloc(), grrrrrr kill it with fire), and a free() that is an empty
> function. There's a calloc() that calls the above malloc() and then
> memset(). And a realloc() that is a no-op if the size is smaller, or does
> malloc(), memcpy() if bigger.
>
> Then I used LD_PRELOAD to replace the standard malloc library with mine.
>
> Result: ~10% faster execution than llc without LD_PRELOAD, and ~180 MB of
> the array used (120 MB on the 32 bit ARM).
>
> Then I built BDW GC as a malloc replacement (with free() as a no-op) and
> used LD_PRELOAD with it.
>
> Result: ~20% faster execution than llc without LD_PRELOAD, and ~10 MB of
> RAM used.
>
> In this experiment all the reference counting in IntrusiveRefCntPtr
> or shared_ptr or whatever still takes place, the same as before. But at the
> end, when it decides to call free, it's a no-op. So all the
> reference-counting machinery is a complete waste of time and code and RAM
> and the program would run strictly faster if it was ripped out.
>
>
> I may miss something in your description, but it seems like you’re never
> releasing memory? I’m not sure I follow how is it a good thing?
>
I did two different tests.

In the first test I never released memory. The compiler allocated 120 - 180
MB of total memory compiling a 40000 line C file. Typical C files are much
smaller that this, so it's potentially a valid strategy if you make a new
invocation of the compile for every C file. However, it was mostly just for
statistics-gathering purposes.

In the second test I used a GC. I never released memory, but it was
collected when objects were no longer reachable.


> Also what about destructor?
>
Stack-based objects would still have destructors called, heap based objects
will not. As 99% of destructors only deal with releasing other memory owned
by the object anyway, this is not important.

Some destructors may be closing files or something like that. I didn't
notice problems. The compiler ran fine in both cases, and produced asm
output identical to running it normally.

This is just an experiment. Obviously, if someone were to decide to replace
explicit memory management with GC in the llvm project then some real work
would be required to audit the code and find any issues.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20161019/b532c876/attachment.html>

Bruce Hoult via llvm-dev

2016-Oct-19 19:20 UTC

head link

[llvm-dev] IntrusiveRefCntPtr vs std::shared_ptr

On Wed, Oct 19, 2016 at 10:06 PM, David Jones <djones at xtreme-eda.com>
wrote:
> Out of curiosity, when you did these tests, did you compile your test
> program with -O2?
>
The test program, or llvm?

llvm was the standard llvm installed by Ubuntu's package manager. I expect
it's compiled with some nonzero -O level

I think I compiled my test program without optimization. That will change
the amount of work llvm does, but I think not the nature of it.

I'm happy to do more tests.

>
>I am working on a large codebase that uses intrusive reference
counting.> I'm finding that without -O (e.g. build for easy debug) it's quite
slow,
> but with -O2 it's maybe 4x faster. This indicates that optimization can
> clean up a lot of the overhead of reference counting.
>
> Whether or not a GC such as BDWGC can help you depends on how much
"memory
> churn" your program does. Again, I've found that my software tends
to
> allocate and free quite a lot of memory, so if I used GC, then I'd
spend
> 25%+ time doing the GC. On the other hand, if you do more computing than
> allocating, then you may come out ahead - GC has zero overhead when not
> actually allocating, whereas reference-counting can have overhead when
> traversing a data structure even if you malloc/free nothing.
>
> As always, the particular application is what matters, and your mileage
> may vary.
>
>
> On Wed, Oct 19, 2016 at 2:14 PM, Bruce Hoult via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
>> On Wed, Oct 19, 2016 at 6:24 PM, Benjamin Kramer via llvm-dev <
>> llvm-dev at lists.llvm.org> wrote:
>>
>>> In terms of performance shared_ptr has a number of disadvantages.
One
>>> is that it always uses atomics even though most IntrusiveRefCntPtrs
>>> are used in single-threaded contexts. Another is weak_ptr adding a
lot
>>> of complexity to the implementation, IntrusiveRefCntPtr doesn't
>>> support weak references.
>>>
>>> With that it's hard to make a case for changing uses of
>>> IntrusiveRefCntPtr as it's a non-trivial amount of work
>>> (IntrusiveRefCntPtr binds the reference count to the object itself,
>>> shared_ptr doesn't. Figuring out when a value held by an
>>> IntrusiveRefCntPtr is passed around by raw pointer and stuffed into
>>> another IntrusiveRefCntPtr is hard) with potential negative
>>> performance impact.
>>>
>>
>> In terms of performance, the whole concept has a number of disavantages
>> :-)
>>
>> I recently tried an experiment. I compiled a 40000 line C file
>> (concatenated all the files of a project together) to .bc with clang,
and
>> then ran llc on it. I tried it on both Ubuntu 16.04 x64 and on an
Odroid
>> XU-4 ARM board. with very similar results.
>>
>> I made a tiny library with a 1 GB static char array. I made a malloc()
>> that simply bumped a pointer (prepending a 32 bit object size, just for
>> realloc(), grrrrrr kill it with fire), and a free() that is an empty
>> function. There's a calloc() that calls the above malloc() and then
>> memset(). And a realloc() that is a no-op if the size is smaller, or
does
>> malloc(), memcpy() if bigger.
>>
>> Then I used LD_PRELOAD to replace the standard malloc library with
mine.
>>
>> Result: ~10% faster execution than llc without LD_PRELOAD, and ~180 MB
of
>> the array used (120 MB on the 32 bit ARM).
>>
>> Then I built BDW GC as a malloc replacement (with free() as a no-op)
and
>> used LD_PRELOAD with it.
>>
>> Result: ~20% faster execution than llc without LD_PRELOAD, and ~10 MB
of
>> RAM used.
>>
>> In this experiment all the reference counting in IntrusiveRefCntPtr
>> or shared_ptr or whatever still takes place, the same as before. But at
the
>> end, when it decides to call free, it's a no-op. So all the
>> reference-counting machinery is a complete waste of time and code and
RAM
>> and the program would run strictly faster if it was ripped out.
>>
>> I don't know for sure (it's a lot more work to try!), but I
would not be
>> surprised to see a further 10%-20% speedup.
>>
>>
>> And then you come to the cognitive load on the programmer, trying to
>> decide whether to use IntrusiveRefCntPtr or shared_ptr or unique_ptr or
>> auto_ptr or weak_ptr or whether and where to call free()/delete. And
the
>> extra typing needed to write it instead of using a raw pointer. And the
>> extra time and cognitive load to read the code. And for what?
>>
>>
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>
>>
>
> --
> This message has been scanned for viruses and
> dangerous content by *MailScanner* <http://www.mailscanner.info/>,
and is
> believed to be clean.-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20161019/92e09398/attachment.html>

Seemingly Similar Threads

Search for more reasonably related threads

llvm dev - Oct 2016 - IntrusiveRefCntPtr vs std::shared_ptr

[llvm-dev] IntrusiveRefCntPtr vs std::shared_ptr

[llvm-dev] IntrusiveRefCntPtr vs std::shared_ptr

[llvm-dev] IntrusiveRefCntPtr vs std::shared_ptr

[llvm-dev] IntrusiveRefCntPtr vs std::shared_ptr

[llvm-dev] IntrusiveRefCntPtr vs std::shared_ptr

Seemingly Similar Threads