thr3ads.net - llvm dev - [LLVMdev] IC profiling infrastructure [Apr 2015]

If this information is useful, please help other people find it:
Share via:

Xinliang David Li

2015-Apr-29 00:24 UTC

[LLVMdev] IC profiling infrastructure

> From: <betulb at codeaurora.org>
> Date: Tue, Apr 7, 2015 at 12:44 PM
> Subject: [LLVMdev] IC profiling infrastructure
> To: llvmdev at cs.uiuc.edu
>
>
>
> Hi All,
>
> We had sent out an RFC in October on indirect call target profiling. The
> proposal was about profiling target addresses seen at indirect call sites.
> Using the profile data we're seeing up to %8 performance improvements
on
> individual spec benchmarks where indirect call sites are present. We've
> already started uploading our patches to the phabricator. I'm looking
> forward to your reviews and comments on the code and ready to respond to
> your design related queries.
>
> There were few questions posted on the RFC that were not responded. Here
> are the much delayed comments.
>
Hi Betul, thank you for your patience.  I have completed initial
comparison with a few alternative value profile designs. My conclusion
is that your proposed approach should well in practice. The study can
be found here:
https://docs.google.com/document/u/1/d/1k-_k_DLFBh8h3XMnPAi6za-XpmjOIPHX_x6UB6PULfw/pub

> 1) Added dependencies: Our implementation adds dependency on calloc/free
> as we’re generating/maintaining a linked list at run time.
If it becomes a problem for some, there is a way to handle that -- but
at a cost of more memory required (to be conservative). One of the
good feature of using dynamic memory is that it allows counter array
allocation on the fly which eliminates the need to allocate memory for
lots of cold/unexecuted functions.
> We also added
> dependency on the usage of mutexes to prevent memory leaks in the case
> multiple threads trying to insert a new target address for the same IC
> site into the linked list. To least impact the performance we only added
> mutexes around the pointer assignment and kept any dynamic memory
> allocation/free operations outside of the mutexed code.
This (using mutexes) should be and can be avoided -- see the above report.
>
> 2) Indirect call data being present in sampling profile output: This is
> unfortunately not helping in our case due to perf depending on lbr
> support. To our knowledge lbr support is not present on ARM platforms.
>
yes.
> 3) Losing profiling support on targets not supporting malloc/mutexes: The
> added dependency on calloc/free/mutexes may perhaps be eliminated
> (although our current solution does not handle this) through having a
> separate run time library for value profiling purposes. Instrumentation
> can link in two run time libraries when value profiling (an instance of it
> being indirect call target profiling) is enabled on the command line.
See above.
>
> 4) Performance of the instrumented code: Instrumentation with IC profiling
> patches resulted in 7% degradation across spec benchmarks at -O2. For the
> benchmarks that did not have any IC sites, no performance degradation was
> observed. This data is gathered using the ref data set for spec.
>
I'd like to make the runtime part of the change to be shared and used
as a general purpose value profiler (not just indirect call
promotion), but this can be done as a follow up.

I will start with some reviews. Hopefully others will help with reviews too.

thanks,

David


> Thanks,
> -Betul Buyukkurt
>
> Qualcomm Innovation Center, Inc.
> Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux
> Foundation Collaborative Project
>
>
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>

Justin Bogner

2015-Apr-29 04:31 UTC

head link

[LLVMdev] IC profiling infrastructure

Xinliang David Li <davidxl at google.com> writes:>> From: <betulb at codeaurora.org>
>> Date: Tue, Apr 7, 2015 at 12:44 PM
>> Subject: [LLVMdev] IC profiling infrastructure
>> To: llvmdev at cs.uiuc.edu
>>
>>
>>
>> Hi All,
>>
>> We had sent out an RFC in October on indirect call target profiling.
The
>> proposal was about profiling target addresses seen at indirect call
sites.
>> Using the profile data we're seeing up to %8 performance
improvements on
>> individual spec benchmarks where indirect call sites are present.
We've
>> already started uploading our patches to the phabricator. I'm
looking
>> forward to your reviews and comments on the code and ready to respond
to
>> your design related queries.
>>
>> There were few questions posted on the RFC that were not responded.
Here
>> are the much delayed comments.
>>
>
> Hi Betul, thank you for your patience.  I have completed initial
> comparison with a few alternative value profile designs. My conclusion
> is that your proposed approach should well in practice. The study can
> be found here:
>
https://docs.google.com/document/u/1/d/1k-_k_DLFBh8h3XMnPAi6za-XpmjOIPHX_x6UB6PULfw/pub
Thanks for looking at this David.

Betul: I also have some thoughts on the approach and implementation of
this, but haven't had a chance to go over it in detail. I hope to have
some feedback for you on all of this sometime next week, and I'll start
reviewing the individual patches after that.
>> 1) Added dependencies: Our implementation adds dependency on
calloc/free
>> as we’re generating/maintaining a linked list at run time.
>
> If it becomes a problem for some, there is a way to handle that -- but
> at a cost of more memory required (to be conservative). One of the
> good feature of using dynamic memory is that it allows counter array
> allocation on the fly which eliminates the need to allocate memory for
> lots of cold/unexecuted functions.
>
>> We also added
>> dependency on the usage of mutexes to prevent memory leaks in the case
>> multiple threads trying to insert a new target address for the same IC
>> site into the linked list. To least impact the performance we only
added
>> mutexes around the pointer assignment and kept any dynamic memory
>> allocation/free operations outside of the mutexed code.
>
> This (using mutexes) should be and can be avoided -- see the above report.
>
>>
>> 2) Indirect call data being present in sampling profile output: This is
>> unfortunately not helping in our case due to perf depending on lbr
>> support. To our knowledge lbr support is not present on ARM platforms.
>>
>
> yes.
>
>> 3) Losing profiling support on targets not supporting malloc/mutexes:
The
>> added dependency on calloc/free/mutexes may perhaps be eliminated
>> (although our current solution does not handle this) through having a
>> separate run time library for value profiling purposes. Instrumentation
>> can link in two run time libraries when value profiling (an instance of
it
>> being indirect call target profiling) is enabled on the command line.
>
> See above.
>
>>
>> 4) Performance of the instrumented code: Instrumentation with IC
profiling
>> patches resulted in 7% degradation across spec benchmarks at -O2. For
the
>> benchmarks that did not have any IC sites, no performance degradation
was
>> observed. This data is gathered using the ref data set for spec.
>>
>
> I'd like to make the runtime part of the change to be shared and used
> as a general purpose value profiler (not just indirect call
> promotion), but this can be done as a follow up.
>
> I will start with some reviews. Hopefully others will help with reviews
too.
>
> thanks,
>
> David
>
>
>
>> Thanks,
>> -Betul Buyukkurt
>>
>> Qualcomm Innovation Center, Inc.
>> Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a
Linux
>> Foundation Collaborative Project
>>
>>
>>
>> _______________________________________________
>> LLVM Developers mailing list
>> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>>
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev

betulb at codeaurora.org

2015-Apr-29 17:19 UTC

head link

[LLVMdev] IC profiling infrastructure

>> From: <betulb at codeaurora.org>
>> Date: Tue, Apr 7, 2015 at 12:44 PM
>> Subject: [LLVMdev] IC profiling infrastructure
>> To: llvmdev at cs.uiuc.edu
>>
>>
>>
>> Hi All,
>>
>> We had sent out an RFC in October on indirect call target profiling.
The
>> proposal was about profiling target addresses seen at indirect call
>> sites.
>> Using the profile data we're seeing up to %8 performance
improvements on
>> individual spec benchmarks where indirect call sites are present.
We've
>> already started uploading our patches to the phabricator. I'm
looking
>> forward to your reviews and comments on the code and ready to respond
to
>> your design related queries.
>>
>> There were few questions posted on the RFC that were not responded.
Here
>> are the much delayed comments.
>>
>
> Hi Betul, thank you for your patience.  I have completed initial
> comparison with a few alternative value profile designs. My conclusion
> is that your proposed approach should well in practice. The study can
> be found here:
>
https://docs.google.com/document/u/1/d/1k-_k_DLFBh8h3XMnPAi6za-XpmjOIPHX_x6UB6PULfw/pub
Hi David,

Thanks for the detailed report and working on this. We really appreciate
the feedback. We're looking forward to the comments and up streaming the
changes.
>
>> 1) Added dependencies: Our implementation adds dependency on
calloc/free
>> as weâre generating/maintaining a linked list at run time.
>
> If it becomes a problem for some, there is a way to handle that -- but
> at a cost of more memory required (to be conservative). One of the
> good feature of using dynamic memory is that it allows counter array
> allocation on the fly which eliminates the need to allocate memory for
> lots of cold/unexecuted functions.
>
>> We also added
>> dependency on the usage of mutexes to prevent memory leaks in the case
>> multiple threads trying to insert a new target address for the same IC
>> site into the linked list. To least impact the performance we only
added
>> mutexes around the pointer assignment and kept any dynamic memory
>> allocation/free operations outside of the mutexed code.
>
> This (using mutexes) should be and can be avoided -- see the above report.
I did read your report carefully. You suggested use of atomic linked list
link update to avoid mutexes. We have a runtime written in C. So I was not
sure if introducing C++11 features like std::atomic was OK or not. Also
some operations can be performed atomically on x86 platforms (based on
data being aligned at various bit length/cache line boundaries) but arm or
other platforms would not support that.
>>
>> 2) Indirect call data being present in sampling profile output: This is
>> unfortunately not helping in our case due to perf depending on lbr
>> support. To our knowledge lbr support is not present on ARM platforms.
>>
>
> yes.
>
>> 3) Losing profiling support on targets not supporting malloc/mutexes:
>> The
>> added dependency on calloc/free/mutexes may perhaps be eliminated
>> (although our current solution does not handle this) through having a
>> separate run time library for value profiling purposes. Instrumentation
>> can link in two run time libraries when value profiling (an instance of
>> it
>> being indirect call target profiling) is enabled on the command line.
>
> See above.
>
>>
>> 4) Performance of the instrumented code: Instrumentation with IC
>> profiling
>> patches resulted in 7% degradation across spec benchmarks at -O2. For
>> the
>> benchmarks that did not have any IC sites, no performance degradation
>> was
>> observed. This data is gathered using the ref data set for spec.
>>
>
> I'd like to make the runtime part of the change to be shared and used
> as a general purpose value profiler (not just indirect call
> promotion), but this can be done as a follow up.
My understanding of your analysis was that it only covered the run-time
library performance and not really looked into if instrumentation is
really enabled at the right sites.
> I will start with some reviews. Hopefully others will help with reviews
> too.
Thanks very much. We'll be responding to the reviews diligently.
> thanks,
>
> David
>
>
>
>> Thanks,
>> -Betul Buyukkurt
>>
>> Qualcomm Innovation Center, Inc.
>> Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a
>> Linux
>> Foundation Collaborative Project
>>
>>
>>
>> _______________________________________________
>> LLVM Developers mailing list
>> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>>
>

Xinliang David Li

2015-Apr-29 17:26 UTC

head link

[LLVMdev] IC profiling infrastructure

On Wed, Apr 29, 2015 at 10:19 AM,  <betulb at codeaurora.org>
wrote:>>> From: <betulb at codeaurora.org>
>>> Date: Tue, Apr 7, 2015 at 12:44 PM
>>> Subject: [LLVMdev] IC profiling infrastructure
>>> To: llvmdev at cs.uiuc.edu
>>>
>>>
>>>
>>> Hi All,
>>>
>>> We had sent out an RFC in October on indirect call target
profiling. The
>>> proposal was about profiling target addresses seen at indirect call
>>> sites.
>>> Using the profile data we're seeing up to %8 performance
improvements on
>>> individual spec benchmarks where indirect call sites are present.
We've
>>> already started uploading our patches to the phabricator. I'm
looking
>>> forward to your reviews and comments on the code and ready to
respond to
>>> your design related queries.
>>>
>>> There were few questions posted on the RFC that were not responded.
Here
>>> are the much delayed comments.
>>>
>>
>> Hi Betul, thank you for your patience.  I have completed initial
>> comparison with a few alternative value profile designs. My conclusion
>> is that your proposed approach should well in practice. The study can
>> be found here:
>>
https://docs.google.com/document/u/1/d/1k-_k_DLFBh8h3XMnPAi6za-XpmjOIPHX_x6UB6PULfw/pub
>
> Hi David,
>
> Thanks for the detailed report and working on this. We really appreciate
> the feedback. We're looking forward to the comments and up streaming
the
> changes.
>
>>
>>> 1) Added dependencies: Our implementation adds dependency on
calloc/free
>>> as we’re generating/maintaining a linked list at run time.
>>
>> If it becomes a problem for some, there is a way to handle that -- but
>> at a cost of more memory required (to be conservative). One of the
>> good feature of using dynamic memory is that it allows counter array
>> allocation on the fly which eliminates the need to allocate memory for
>> lots of cold/unexecuted functions.
>>
>>> We also added
>>> dependency on the usage of mutexes to prevent memory leaks in the
case
>>> multiple threads trying to insert a new target address for the same
IC
>>> site into the linked list. To least impact the performance we only
added
>>> mutexes around the pointer assignment and kept any dynamic memory
>>> allocation/free operations outside of the mutexed code.
>>
>> This (using mutexes) should be and can be avoided -- see the above
report.
>
> I did read your report carefully. You suggested use of atomic linked list
> link update to avoid mutexes. We have a runtime written in C. So I was not
> sure if introducing C++11 features like std::atomic was OK or not. Also
> some operations can be performed atomically on x86 platforms (based on
> data being aligned at various bit length/cache line boundaries) but arm or
> other platforms would not support that.
The suggestion is to use the atomic builtins -- see the review comments.
>
>>>
>>> 2) Indirect call data being present in sampling profile output:
This is
>>> unfortunately not helping in our case due to perf depending on lbr
>>> support. To our knowledge lbr support is not present on ARM
platforms.
>>>
>>
>> yes.
>>
>>> 3) Losing profiling support on targets not supporting
malloc/mutexes:
>>> The
>>> added dependency on calloc/free/mutexes may perhaps be eliminated
>>> (although our current solution does not handle this) through having
a
>>> separate run time library for value profiling purposes.
Instrumentation
>>> can link in two run time libraries when value profiling (an
instance of
>>> it
>>> being indirect call target profiling) is enabled on the command
line.
>>
>> See above.
>>
>>>
>>> 4) Performance of the instrumented code: Instrumentation with IC
>>> profiling
>>> patches resulted in 7% degradation across spec benchmarks at -O2.
For
>>> the
>>> benchmarks that did not have any IC sites, no performance
degradation
>>> was
>>> observed. This data is gathered using the ref data set for spec.
>>>
>>
>> I'd like to make the runtime part of the change to be shared and
used
>> as a general purpose value profiler (not just indirect call
>> promotion), but this can be done as a follow up.
>
> My understanding of your analysis was that it only covered the run-time
> library performance and not really looked into if instrumentation is
> really enabled at the right sites.
It was mainly focusing on the runtime library performance.
>
>> I will start with some reviews. Hopefully others will help with reviews
>> too.
I looked through one patch and sent the comments.

David

>
> Thanks very much. We'll be responding to the reviews diligently.
>
>> thanks,
>>
>> David
>>
>>
>>
>>> Thanks,
>>> -Betul Buyukkurt
>>>
>>> Qualcomm Innovation Center, Inc.
>>> Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a
>>> Linux
>>> Foundation Collaborative Project
>>>
>>>
>>>
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>>>
>>
>
>

Betul Buyukkurt

2015-May-13 17:49 UTC

head link

[LLVMdev] IC profiling infrastructure

> Xinliang David Li <davidxl at google.com> writes:
>>> From: <betulb at codeaurora.org>
>>> Date: Tue, Apr 7, 2015 at 12:44 PM
>>> Subject: [LLVMdev] IC profiling infrastructure
>>> To: llvmdev at cs.uiuc.edu
>>>
>>>
>>>
>>> Hi All,
>>>
>>> We had sent out an RFC in October on indirect call target
profiling.
>>> The
>>> proposal was about profiling target addresses seen at indirect call
>>> sites.
>>> Using the profile data we're seeing up to %8 performance
improvements
>>> on
>>> individual spec benchmarks where indirect call sites are present.
We've
>>> already started uploading our patches to the phabricator. I'm
looking
>>> forward to your reviews and comments on the code and ready to
respond
>>> to
>>> your design related queries.
>>>
>>> There were few questions posted on the RFC that were not responded.
>>> Here
>>> are the much delayed comments.
>>>
>>
>> Hi Betul, thank you for your patience.  I have completed initial
>> comparison with a few alternative value profile designs. My conclusion
>> is that your proposed approach should well in practice. The study can
>> be found here:
>>
https://docs.google.com/document/u/1/d/1k-_k_DLFBh8h3XMnPAi6za-XpmjOIPHX_x6UB6PULfw/pub
>
> Thanks for looking at this David.
>
> Betul: I also have some thoughts on the approach and implementation of
> this, but haven't had a chance to go over it in detail. I hope to have
> some feedback for you on all of this sometime next week, and I'll start
> reviewing the individual patches after that.
Hi All,

I've posted three more patches yesterday. They might be missing some
cosmetic fixes, but the support for profiling multiple value kinds have
been added to the readers, writers and runtime. I'd appreciate your
comments on the CL's.

Thanks,
-Betul
>
>>> 1) Added dependencies: Our implementation adds dependency on
>>> calloc/free
>>> as weâre generating/maintaining a linked list at run time.
>>
>> If it becomes a problem for some, there is a way to handle that -- but
>> at a cost of more memory required (to be conservative). One of the
>> good feature of using dynamic memory is that it allows counter array
>> allocation on the fly which eliminates the need to allocate memory for
>> lots of cold/unexecuted functions.
>>
>>> We also added
>>> dependency on the usage of mutexes to prevent memory leaks in the
case
>>> multiple threads trying to insert a new target address for the same
IC
>>> site into the linked list. To least impact the performance we only
>>> added
>>> mutexes around the pointer assignment and kept any dynamic memory
>>> allocation/free operations outside of the mutexed code.
>>
>> This (using mutexes) should be and can be avoided -- see the above
>> report.
>>
>>>
>>> 2) Indirect call data being present in sampling profile output:
This is
>>> unfortunately not helping in our case due to perf depending on lbr
>>> support. To our knowledge lbr support is not present on ARM
platforms.
>>>
>>
>> yes.
>>
>>> 3) Losing profiling support on targets not supporting
malloc/mutexes:
>>> The
>>> added dependency on calloc/free/mutexes may perhaps be eliminated
>>> (although our current solution does not handle this) through having
a
>>> separate run time library for value profiling purposes.
Instrumentation
>>> can link in two run time libraries when value profiling (an
instance of
>>> it
>>> being indirect call target profiling) is enabled on the command
line.
>>
>> See above.
>>
>>>
>>> 4) Performance of the instrumented code: Instrumentation with IC
>>> profiling
>>> patches resulted in 7% degradation across spec benchmarks at -O2.
For
>>> the
>>> benchmarks that did not have any IC sites, no performance
degradation
>>> was
>>> observed. This data is gathered using the ref data set for spec.
>>>
>>
>> I'd like to make the runtime part of the change to be shared and
used
>> as a general purpose value profiler (not just indirect call
>> promotion), but this can be done as a follow up.
>>
>> I will start with some reviews. Hopefully others will help with reviews
>> too.
>>
>> thanks,
>>
>> David
>>
>>
>>
>>> Thanks,
>>> -Betul Buyukkurt
>>>
>>> Qualcomm Innovation Center, Inc.
>>> Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a
>>> Linux
>>> Foundation Collaborative Project
>>>
>>>
>>>
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>>>
>>
>> _______________________________________________
>> LLVM Developers mailing list
>> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>

Seemingly Similar Threads

Search for more maybe matching threads

llvm dev - Apr 2015 - [LLVMdev] IC profiling infrastructure

[LLVMdev] IC profiling infrastructure

[LLVMdev] IC profiling infrastructure

[LLVMdev] IC profiling infrastructure

[LLVMdev] IC profiling infrastructure

[LLVMdev] IC profiling infrastructure

Seemingly Similar Threads