thr3ads.net - llvm dev - [llvm-dev] LLD: Possible optimization for TargetInfo [Mar 2016]

If this information is useful, please help other people find it:
Share via:

Rui Ueyama via llvm-dev

2016-Mar-31 01:17 UTC

[llvm-dev] LLD: Possible optimization for TargetInfo

On Wed, Mar 30, 2016 at 5:34 PM, Sean Silva <chisophugis at gmail.com>
wrote:
>
>
> On Wed, Mar 30, 2016 at 4:25 PM, Rui Ueyama <ruiu at google.com>
wrote:
>
>> On Wed, Mar 30, 2016 at 4:20 PM, Sean Silva <chisophugis at
gmail.com>
>> wrote:
>>
>>> I believe the relocation stuff that Rafael is currently working on
will
>>> make this a non-issue (it will make relocation application much
friendlier
>>> for the CPU).
>>>
>>
>> I don't think Rafael's patch would make this a non-issue.
He's making
>> scanRelocs to create data, which would reduce the number of calls to
the
>> virtual functions, but it wouldn't be reduced to zero.
>>
>> However, even in the current scheme, since the target is fixed, all the
>>> indirect call sites should be monomorphic and so there
shouldn't be much
>>> branch-prediction cost (certainly nothing that would cause 1.8%
performance
>>> delta for the entire link).
>>>
>>
>> Agreed. We could template functions that call TargetInfo's member
>> functions for each target to eliminate the virtual function calls.
>>
>
> From what has been presented I would not conclude that virtual calls are
> actually the problem (or a problem at all). A root-cause analysis is
> necessary. As r263227 shows, the relocation application loop is very
> sensitive to small changes.
>
> One quick thing you may also want to try as a sanity check is inserting
> nops in different places in the function. I suspect you'll find that
the
> performance swings (both speedups and slowdowns) from doing that are
> similar in magnitude to what you are seeing. You may also want to try
> editing the indirect call instruction to a direct call without otherwise
> modifying the binary; if that reproduces the 1.8% speedup then it will be
> convincing.
>
Honestly I was somewhat skeptical about what you wrote here, but I observed
0.4% *slowdown* when I used gcc to compile it, so looks like I was wrong.
It is possible that devirtualization might have been effective for
clang-generated code, but it is more likely that that was a result of some
performance deviation caused by some other factor.

The relocation handling loop is really a tight loop and therefore sensitive
to small changes. How can we optimize this? Maybe PGO?

If you haven't read it, I think you would enjoy this
paper:>
http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/37077.pdf
>
> -- Sean Silva
>
>
>>
>>
>>> Notice that 1.8% is smaller than the performance variation from
r263227
>>> which is a very innocuous-looking change but caused ~2-3% slowdown
for
>>> ScyllaDB (see the thread "LLD performance w.r.t. local symbols
(and
>>> --build-id)").
>>>
>>> -- Sean Silva
>>>
>>> On Wed, Mar 30, 2016 at 3:39 PM, Rui Ueyama via llvm-dev <
>>> llvm-dev at lists.llvm.org> wrote:
>>>
>>>> I was wandering how much is the overhead of virtual function
calls of
>>>> TargetInfo member functions. TargetInfo handles
platform-specific details,
>>>> and we have target-specific subclasses of that class. The
subclasses
>>>> override functions defined in TargetInfo.
>>>>
>>>> The TargetInfo member functions are called multiple times for
each
>>>> relocation. So the cost of virtual function calls may be
non-neglible. That
>>>> is a motication to do the following test.
>>>>
>>>> As a test, I removed all TargetInfo subclasses except for
x86-64, move
>>>> all code from X86_64TargetInfo to TargetInfo, and remove
`virtual` from
>>>> TargetInfo.
>>>>
>>>> The original LLD links itself (with debug info) in 7.499
seconds. The
>>>> de-virtualized version did the same thing in 7.364 seconds. So
it can
>>>> improve it by 1.8%.
>>>>
>>>> I'm just pointing out that there's room there to
improve performance,
>>>> and I'm not suggesting we do something for this right now.
We probably
>>>> shouldn't do anything for this because the current code is
pretty
>>>> straightforward. But I'd expect that we will eventually
want do something
>>>> for this in future.
>>>>
>>>> _______________________________________________
>>>> LLVM Developers mailing list
>>>> llvm-dev at lists.llvm.org
>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>
>>>>
>>>
>>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160330/301ed83a/attachment.html>

Sean Silva via llvm-dev

2016-Mar-31 01:42 UTC

head link

[llvm-dev] LLD: Possible optimization for TargetInfo

On Wed, Mar 30, 2016 at 6:17 PM, Rui Ueyama <ruiu at google.com> wrote:
> On Wed, Mar 30, 2016 at 5:34 PM, Sean Silva <chisophugis at
gmail.com> wrote:
>
>>
>>
>> On Wed, Mar 30, 2016 at 4:25 PM, Rui Ueyama <ruiu at google.com>
wrote:
>>
>>> On Wed, Mar 30, 2016 at 4:20 PM, Sean Silva <chisophugis at
gmail.com>
>>> wrote:
>>>
>>>> I believe the relocation stuff that Rafael is currently working
on will
>>>> make this a non-issue (it will make relocation application much
friendlier
>>>> for the CPU).
>>>>
>>>
>>> I don't think Rafael's patch would make this a non-issue.
He's making
>>> scanRelocs to create data, which would reduce the number of calls
to the
>>> virtual functions, but it wouldn't be reduced to zero.
>>>
>>> However, even in the current scheme, since the target is fixed, all
the
>>>> indirect call sites should be monomorphic and so there
shouldn't be much
>>>> branch-prediction cost (certainly nothing that would cause 1.8%
performance
>>>> delta for the entire link).
>>>>
>>>
>>> Agreed. We could template functions that call TargetInfo's
member
>>> functions for each target to eliminate the virtual function calls.
>>>
>>
>> From what has been presented I would not conclude that virtual calls
are
>> actually the problem (or a problem at all). A root-cause analysis is
>> necessary. As r263227 shows, the relocation application loop is very
>> sensitive to small changes.
>>
>> One quick thing you may also want to try as a sanity check is inserting
>> nops in different places in the function. I suspect you'll find
that the
>> performance swings (both speedups and slowdowns) from doing that are
>> similar in magnitude to what you are seeing. You may also want to try
>> editing the indirect call instruction to a direct call without
otherwise
>> modifying the binary; if that reproduces the 1.8% speedup then it will
be
>> convincing.
>>
>
> Honestly I was somewhat skeptical about what you wrote here, but I
> observed 0.4% *slowdown* when I used gcc to compile it, so looks like I was
> wrong. It is possible that devirtualization might have been effective for
> clang-generated code, but it is more likely that that was a result of some
> performance deviation caused by some other factor.
>
> The relocation handling loop is really a tight loop and therefore
> sensitive to small changes. How can we optimize this? Maybe PGO?
>
Rafael's change will fix it. That is why he is doing it in the first place
:)
(The idea came when we were crunching the numbers for "LLD performance
w.r.t. local symbols (and --build-id)" and looked at r263227. I suggested
that this looked like it was because this loop is getting long enough to
run-out the CPU's reorder buffer waiting on the cache misses, preventing it
from seeing the memory accesses of the next iteration and thus failing to
pipeline the memory accesses across iterations. Small changes in
scheduling, instruction count, etc. will tickle this and cause large
performance changes. The solution is to make the relocation application
loop tighter. Especially by separating some of the stuff that we currently
have inside one huge loop to be in separate loops, but also by making the
loops tighter.)

-- Sean Silva

>
> If you haven't read it, I think you would enjoy this paper:
>>
http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/37077.pdf
>>
>> -- Sean Silva
>>
>>
>>>
>>>
>>>> Notice that 1.8% is smaller than the performance variation from
r263227
>>>> which is a very innocuous-looking change but caused ~2-3%
slowdown for
>>>> ScyllaDB (see the thread "LLD performance w.r.t. local
symbols (and
>>>> --build-id)").
>>>>
>>>> -- Sean Silva
>>>>
>>>> On Wed, Mar 30, 2016 at 3:39 PM, Rui Ueyama via llvm-dev <
>>>> llvm-dev at lists.llvm.org> wrote:
>>>>
>>>>> I was wandering how much is the overhead of virtual
function calls of
>>>>> TargetInfo member functions. TargetInfo handles
platform-specific details,
>>>>> and we have target-specific subclasses of that class. The
subclasses
>>>>> override functions defined in TargetInfo.
>>>>>
>>>>> The TargetInfo member functions are called multiple times
for each
>>>>> relocation. So the cost of virtual function calls may be
non-neglible. That
>>>>> is a motication to do the following test.
>>>>>
>>>>> As a test, I removed all TargetInfo subclasses except for
x86-64, move
>>>>> all code from X86_64TargetInfo to TargetInfo, and remove
`virtual` from
>>>>> TargetInfo.
>>>>>
>>>>> The original LLD links itself (with debug info) in 7.499
seconds. The
>>>>> de-virtualized version did the same thing in 7.364 seconds.
So it can
>>>>> improve it by 1.8%.
>>>>>
>>>>> I'm just pointing out that there's room there to
improve performance,
>>>>> and I'm not suggesting we do something for this right
now. We probably
>>>>> shouldn't do anything for this because the current code
is pretty
>>>>> straightforward. But I'd expect that we will eventually
want do something
>>>>> for this in future.
>>>>>
>>>>> _______________________________________________
>>>>> LLVM Developers mailing list
>>>>> llvm-dev at lists.llvm.org
>>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>>
>>>>>
>>>>
>>>
>>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160330/9dcc96d3/attachment-0001.html>

Rui Ueyama via llvm-dev

2016-Mar-31 01:47 UTC

head link

[llvm-dev] LLD: Possible optimization for TargetInfo

On Wed, Mar 30, 2016 at 6:42 PM, Sean Silva <chisophugis at gmail.com>
wrote:
>
>
> On Wed, Mar 30, 2016 at 6:17 PM, Rui Ueyama <ruiu at google.com>
wrote:
>
>> On Wed, Mar 30, 2016 at 5:34 PM, Sean Silva <chisophugis at
gmail.com>
>> wrote:
>>
>>>
>>>
>>> On Wed, Mar 30, 2016 at 4:25 PM, Rui Ueyama <ruiu at
google.com> wrote:
>>>
>>>> On Wed, Mar 30, 2016 at 4:20 PM, Sean Silva <chisophugis at
gmail.com>
>>>> wrote:
>>>>
>>>>> I believe the relocation stuff that Rafael is currently
working on
>>>>> will make this a non-issue (it will make relocation
application much
>>>>> friendlier for the CPU).
>>>>>
>>>>
>>>> I don't think Rafael's patch would make this a
non-issue. He's making
>>>> scanRelocs to create data, which would reduce the number of
calls to the
>>>> virtual functions, but it wouldn't be reduced to zero.
>>>>
>>>> However, even in the current scheme, since the target is fixed,
all the
>>>>> indirect call sites should be monomorphic and so there
shouldn't be much
>>>>> branch-prediction cost (certainly nothing that would cause
1.8% performance
>>>>> delta for the entire link).
>>>>>
>>>>
>>>> Agreed. We could template functions that call TargetInfo's
member
>>>> functions for each target to eliminate the virtual function
calls.
>>>>
>>>
>>> From what has been presented I would not conclude that virtual
calls are
>>> actually the problem (or a problem at all). A root-cause analysis
is
>>> necessary. As r263227 shows, the relocation application loop is
very
>>> sensitive to small changes.
>>>
>>> One quick thing you may also want to try as a sanity check is
inserting
>>> nops in different places in the function. I suspect you'll find
that the
>>> performance swings (both speedups and slowdowns) from doing that
are
>>> similar in magnitude to what you are seeing. You may also want to
try
>>> editing the indirect call instruction to a direct call without
otherwise
>>> modifying the binary; if that reproduces the 1.8% speedup then it
will be
>>> convincing.
>>>
>>
>> Honestly I was somewhat skeptical about what you wrote here, but I
>> observed 0.4% *slowdown* when I used gcc to compile it, so looks like I
was
>> wrong. It is possible that devirtualization might have been effective
for
>> clang-generated code, but it is more likely that that was a result of
some
>> performance deviation caused by some other factor.
>>
>> The relocation handling loop is really a tight loop and therefore
>> sensitive to small changes. How can we optimize this? Maybe PGO?
>>
>
> Rafael's change will fix it. That is why he is doing it in the first
place
> :)
> (The idea came when we were crunching the numbers for "LLD performance
> w.r.t. local symbols (and --build-id)" and looked at r263227. I
suggested
> that this looked like it was because this loop is getting long enough to
> run-out the CPU's reorder buffer waiting on the cache misses,
preventing it
> from seeing the memory accesses of the next iteration and thus failing to
> pipeline the memory accesses across iterations. Small changes in
> scheduling, instruction count, etc. will tickle this and cause large
> performance changes. The solution is to make the relocation application
> loop tighter. Especially by separating some of the stuff that we currently
> have inside one huge loop to be in separate loops, but also by making the
> loops tighter.)
>
Well, as much as I'm now skeptical about my result, I'm skeptical about
that would really fix it, but maybe we should just wait and measure it once
Rafael's patch is ready. :)

-- Sean Silva>
>
>>
>> If you haven't read it, I think you would enjoy this paper:
>>>
http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/37077.pdf
>>>
>>> -- Sean Silva
>>>
>>>
>>>>
>>>>
>>>>> Notice that 1.8% is smaller than the performance variation
from r263227
>>>>> which is a very innocuous-looking change but caused ~2-3%
slowdown
>>>>> for ScyllaDB (see the thread "LLD performance w.r.t.
local symbols (and
>>>>> --build-id)").
>>>>>
>>>>> -- Sean Silva
>>>>>
>>>>> On Wed, Mar 30, 2016 at 3:39 PM, Rui Ueyama via llvm-dev
<
>>>>> llvm-dev at lists.llvm.org> wrote:
>>>>>
>>>>>> I was wandering how much is the overhead of virtual
function calls of
>>>>>> TargetInfo member functions. TargetInfo handles
platform-specific details,
>>>>>> and we have target-specific subclasses of that class.
The subclasses
>>>>>> override functions defined in TargetInfo.
>>>>>>
>>>>>> The TargetInfo member functions are called multiple
times for each
>>>>>> relocation. So the cost of virtual function calls may
be non-neglible. That
>>>>>> is a motication to do the following test.
>>>>>>
>>>>>> As a test, I removed all TargetInfo subclasses except
for x86-64,
>>>>>> move all code from X86_64TargetInfo to TargetInfo, and
remove `virtual`
>>>>>> from TargetInfo.
>>>>>>
>>>>>> The original LLD links itself (with debug info) in
7.499 seconds. The
>>>>>> de-virtualized version did the same thing in 7.364
seconds. So it can
>>>>>> improve it by 1.8%.
>>>>>>
>>>>>> I'm just pointing out that there's room there
to improve performance,
>>>>>> and I'm not suggesting we do something for this
right now. We probably
>>>>>> shouldn't do anything for this because the current
code is pretty
>>>>>> straightforward. But I'd expect that we will
eventually want do something
>>>>>> for this in future.
>>>>>>
>>>>>> _______________________________________________
>>>>>> LLVM Developers mailing list
>>>>>> llvm-dev at lists.llvm.org
>>>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160330/8eb0f030/attachment.html>

Reasonably Related Threads

Search for more maybe matching threads

llvm dev - Mar 2016 - LLD: Possible optimization for TargetInfo

[llvm-dev] LLD: Possible optimization for TargetInfo

[llvm-dev] LLD: Possible optimization for TargetInfo

[llvm-dev] LLD: Possible optimization for TargetInfo

Reasonably Related Threads