Hi Betul, I've finally gotten around to going over this in detail - sorry for the delay, and thanks for working on this. I think that the general approach is a good one, and that this will end up working well. I have a couple of points on a high level, and I'll also send some review for the patches you've sent out. - The raw profile format does not need to be backwards compatible, that's only important for the indexed one. Instead of adding RawInstrValueProfReader, just change RawInstrProfReader and reject the data if the version is 1. Similarly, don't change the raw profile magic - just bump the version to 2. - We don't need to store the value profiling kind in the data at all. The frontend knows which invocations of the intrinsic are for each kind implicitly, much like it knows the difference between a counter for an "if" and a "for" apart implicitly. Just store one set of profiling data and let the frontend sort it out. - Given that, the instrprof libraries needn't know about kinds at all, that can be dealt with locally in clang's CodeGenPGO, where we need to know what to do with the data. I'd just drop the kind completely for now. It'll be easy to add later without affecting much. - We should be able to statically allocate the first entry for each callsite for __llvm_profile_value_data. It will always be there, and for call sites that always call the same target it's nice to avoid the allocation. - The changes in the llvm and compiler-rt repos need their own tests. - Please use clang-format on your changes. Several of the patches have unusual or non-llvm-style whitespace/wrapping/etc. Finally, the LLVM patch is quite large. We can probably make it a bit easier to stage and review by splitting out a couple of things and doing them first: 1. Changing the __llvm_profile_data variables so they're no longer const. 2. Adding the InstrProfRecord type and updating the unittests. Also, this type should just go in the InstrProf.h header, no need for its own. 3. Arguably, we could update the raw format and implement the RawInstrProfReader changes (but ignoring the new data) first. If you think this is worthwhile go ahead, but if it'll lead to too much extra busywork it probably isn't worth it. I'm sending some review on the patches themselves as well, but I expect those to mostly be on the mechanical aspects since the high level points are already written down here. Xinliang David Li <davidxl at google.com> writes:> I have sent my review comments. I think most of my high level concerns > have been addressed (except for last few minor fix ups). > > Justin, do you have a chance to take a look? > > thanks, > > David > > On Wed, May 13, 2015 at 10:49 AM, Betul Buyukkurt <betulb at codeaurora.org> wrote: >>> Xinliang David Li <davidxl at google.com> writes: >>>>> From: <betulb at codeaurora.org> >>>>> Date: Tue, Apr 7, 2015 at 12:44 PM >>>>> Subject: [LLVMdev] IC profiling infrastructure >>>>> To: llvmdev at cs.uiuc.edu >>>>> >>>>> >>>>> >>>>> Hi All, >>>>> >>>>> We had sent out an RFC in October on indirect call target profiling. >>>>> The >>>>> proposal was about profiling target addresses seen at indirect call >>>>> sites. >>>>> Using the profile data we're seeing up to %8 performance improvements >>>>> on >>>>> individual spec benchmarks where indirect call sites are present. We've >>>>> already started uploading our patches to the phabricator. I'm looking >>>>> forward to your reviews and comments on the code and ready to respond >>>>> to >>>>> your design related queries. >>>>> >>>>> There were few questions posted on the RFC that were not responded. >>>>> Here >>>>> are the much delayed comments. >>>>> >>>> >>>> Hi Betul, thank you for your patience. I have completed initial >>>> comparison with a few alternative value profile designs. My conclusion >>>> is that your proposed approach should well in practice. The study can >>>> be found here: >>>> https://docs.google.com/document/u/1/d/1k-_k_DLFBh8h3XMnPAi6za-XpmjOIPHX_x6UB6PULfw/pub >>> >>> Thanks for looking at this David. >>> >>> Betul: I also have some thoughts on the approach and implementation of >>> this, but haven't had a chance to go over it in detail. I hope to have >>> some feedback for you on all of this sometime next week, and I'll start >>> reviewing the individual patches after that. >> >> Hi All, >> >> I've posted three more patches yesterday. They might be missing some >> cosmetic fixes, but the support for profiling multiple value kinds have >> been added to the readers, writers and runtime. I'd appreciate your >> comments on the CL's. >> >> Thanks, >> -Betul >> >>> >>>>> 1) Added dependencies: Our implementation adds dependency on >>>>> calloc/free >>>>> as we’re generating/maintaining a linked list at run time. >>>> >>>> If it becomes a problem for some, there is a way to handle that -- but >>>> at a cost of more memory required (to be conservative). One of the >>>> good feature of using dynamic memory is that it allows counter array >>>> allocation on the fly which eliminates the need to allocate memory for >>>> lots of cold/unexecuted functions. >>>> >>>>> We also added >>>>> dependency on the usage of mutexes to prevent memory leaks in the case >>>>> multiple threads trying to insert a new target address for the same IC >>>>> site into the linked list. To least impact the performance we only >>>>> added >>>>> mutexes around the pointer assignment and kept any dynamic memory >>>>> allocation/free operations outside of the mutexed code. >>>> >>>> This (using mutexes) should be and can be avoided -- see the above >>>> report. >>>> >>>>> >>>>> 2) Indirect call data being present in sampling profile output: This is >>>>> unfortunately not helping in our case due to perf depending on lbr >>>>> support. To our knowledge lbr support is not present on ARM platforms. >>>>> >>>> >>>> yes. >>>> >>>>> 3) Losing profiling support on targets not supporting malloc/mutexes: >>>>> The >>>>> added dependency on calloc/free/mutexes may perhaps be eliminated >>>>> (although our current solution does not handle this) through having a >>>>> separate run time library for value profiling purposes. Instrumentation >>>>> can link in two run time libraries when value profiling (an instance of >>>>> it >>>>> being indirect call target profiling) is enabled on the command line. >>>> >>>> See above. >>>> >>>>> >>>>> 4) Performance of the instrumented code: Instrumentation with IC >>>>> profiling >>>>> patches resulted in 7% degradation across spec benchmarks at -O2. For >>>>> the >>>>> benchmarks that did not have any IC sites, no performance degradation >>>>> was >>>>> observed. This data is gathered using the ref data set for spec. >>>>> >>>> >>>> I'd like to make the runtime part of the change to be shared and used >>>> as a general purpose value profiler (not just indirect call >>>> promotion), but this can be done as a follow up. >>>> >>>> I will start with some reviews. Hopefully others will help with reviews >>>> too. >>>> >>>> thanks, >>>> >>>> David >>>> >>>> >>>> >>>>> Thanks, >>>>> -Betul Buyukkurt >>>>> >>>>> Qualcomm Innovation Center, Inc. >>>>> Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a >>>>> Linux >>>>> Foundation Collaborative Project >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> LLVM Developers mailing list >>>>> LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu >>>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev >>>>> >>>> >>>> _______________________________________________ >>>> LLVM Developers mailing list >>>> LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu >>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev >>> >> >>
On Thu, May 21, 2015 at 4:10 PM, Justin Bogner <mail at justinbogner.com> wrote:> Hi Betul, > > I've finally gotten around to going over this in detail - sorry for the > delay, and thanks for working on this. > > I think that the general approach is a good one, and that this will end > up working well. I have a couple of points on a high level, and I'll > also send some review for the patches you've sent out. > > - The raw profile format does not need to be backwards compatible, > that's only important for the indexed one. Instead of adding > RawInstrValueProfReader, just change RawInstrProfReader and reject the > data if the version is 1. Similarly, don't change the raw profile > magic - just bump the version to 2.I second this.> > - We don't need to store the value profiling kind in the data at all. > The frontend knows which invocations of the intrinsic are for each kind > implicitly, much like it knows the difference between a counter for an > "if" and a "for" apart implicitly. Just store one set of profiling data > and let the frontend sort it out.I was going to suggest this and there is a better way: Note that: const IntPtrT Values[instr_value_prof_kind::last]; // NULL, used at runtime This field can actually be used to point to the start of the value entries for each kind. On the other hand, storing value kind does not waste any space -- as it is a 32bit integer (paired with counterIdx, another 32 bit integer).> > - Given that, the instrprof libraries needn't know about kinds at all, > that can be dealt with locally in clang's CodeGenPGO, where we need to > know what to do with the data. I'd just drop the kind completely for > now. It'll be easy to add later without affecting much.This is doable, but at the cost of some flexibility in the future. For instance for different value_profile_kind, the in-disk format may be different -- e.g, for some cases we only care about storing 'average value' per-site etc to reduce storage size.> > - We should be able to statically allocate the first entry for each > callsite for __llvm_profile_value_data. It will always be there, and > for call sites that always call the same target it's nice to avoid the > allocation.Since we need dynamic allocation any way, I don't see much benefit of doing static allocation here. In my experience of large apps, a big percentage (say >50%) of functions are really cold in the training run which will never be exercised/called. Dynamic allocation has the advantage of minimizing memory consumption. thanks, David> > - The changes in the llvm and compiler-rt repos need their own tests. > > - Please use clang-format on your changes. Several of the patches have > unusual or non-llvm-style whitespace/wrapping/etc. > > Finally, the LLVM patch is quite large. We can probably make it a bit > easier to stage and review by splitting out a couple of things and doing > them first: > > 1. Changing the __llvm_profile_data variables so they're no longer > const. > > 2. Adding the InstrProfRecord type and updating the unittests. Also, > this type should just go in the InstrProf.h header, no need for its > own. > > 3. Arguably, we could update the raw format and implement the > RawInstrProfReader changes (but ignoring the new data) first. If you > think this is worthwhile go ahead, but if it'll lead to too much > extra busywork it probably isn't worth it. > > I'm sending some review on the patches themselves as well, but I expect > those to mostly be on the mechanical aspects since the high level points > are already written down here. > > Xinliang David Li <davidxl at google.com> writes: >> I have sent my review comments. I think most of my high level concerns >> have been addressed (except for last few minor fix ups). >> >> Justin, do you have a chance to take a look? >> >> thanks, >> >> David >> >> On Wed, May 13, 2015 at 10:49 AM, Betul Buyukkurt <betulb at codeaurora.org> wrote: >>>> Xinliang David Li <davidxl at google.com> writes: >>>>>> From: <betulb at codeaurora.org> >>>>>> Date: Tue, Apr 7, 2015 at 12:44 PM >>>>>> Subject: [LLVMdev] IC profiling infrastructure >>>>>> To: llvmdev at cs.uiuc.edu >>>>>> >>>>>> >>>>>> >>>>>> Hi All, >>>>>> >>>>>> We had sent out an RFC in October on indirect call target profiling. >>>>>> The >>>>>> proposal was about profiling target addresses seen at indirect call >>>>>> sites. >>>>>> Using the profile data we're seeing up to %8 performance improvements >>>>>> on >>>>>> individual spec benchmarks where indirect call sites are present. We've >>>>>> already started uploading our patches to the phabricator. I'm looking >>>>>> forward to your reviews and comments on the code and ready to respond >>>>>> to >>>>>> your design related queries. >>>>>> >>>>>> There were few questions posted on the RFC that were not responded. >>>>>> Here >>>>>> are the much delayed comments. >>>>>> >>>>> >>>>> Hi Betul, thank you for your patience. I have completed initial >>>>> comparison with a few alternative value profile designs. My conclusion >>>>> is that your proposed approach should well in practice. The study can >>>>> be found here: >>>>> https://docs.google.com/document/u/1/d/1k-_k_DLFBh8h3XMnPAi6za-XpmjOIPHX_x6UB6PULfw/pub >>>> >>>> Thanks for looking at this David. >>>> >>>> Betul: I also have some thoughts on the approach and implementation of >>>> this, but haven't had a chance to go over it in detail. I hope to have >>>> some feedback for you on all of this sometime next week, and I'll start >>>> reviewing the individual patches after that. >>> >>> Hi All, >>> >>> I've posted three more patches yesterday. They might be missing some >>> cosmetic fixes, but the support for profiling multiple value kinds have >>> been added to the readers, writers and runtime. I'd appreciate your >>> comments on the CL's. >>> >>> Thanks, >>> -Betul >>> >>>> >>>>>> 1) Added dependencies: Our implementation adds dependency on >>>>>> calloc/free >>>>>> as we’re generating/maintaining a linked list at run time. >>>>> >>>>> If it becomes a problem for some, there is a way to handle that -- but >>>>> at a cost of more memory required (to be conservative). One of the >>>>> good feature of using dynamic memory is that it allows counter array >>>>> allocation on the fly which eliminates the need to allocate memory for >>>>> lots of cold/unexecuted functions. >>>>> >>>>>> We also added >>>>>> dependency on the usage of mutexes to prevent memory leaks in the case >>>>>> multiple threads trying to insert a new target address for the same IC >>>>>> site into the linked list. To least impact the performance we only >>>>>> added >>>>>> mutexes around the pointer assignment and kept any dynamic memory >>>>>> allocation/free operations outside of the mutexed code. >>>>> >>>>> This (using mutexes) should be and can be avoided -- see the above >>>>> report. >>>>> >>>>>> >>>>>> 2) Indirect call data being present in sampling profile output: This is >>>>>> unfortunately not helping in our case due to perf depending on lbr >>>>>> support. To our knowledge lbr support is not present on ARM platforms. >>>>>> >>>>> >>>>> yes. >>>>> >>>>>> 3) Losing profiling support on targets not supporting malloc/mutexes: >>>>>> The >>>>>> added dependency on calloc/free/mutexes may perhaps be eliminated >>>>>> (although our current solution does not handle this) through having a >>>>>> separate run time library for value profiling purposes. Instrumentation >>>>>> can link in two run time libraries when value profiling (an instance of >>>>>> it >>>>>> being indirect call target profiling) is enabled on the command line. >>>>> >>>>> See above. >>>>> >>>>>> >>>>>> 4) Performance of the instrumented code: Instrumentation with IC >>>>>> profiling >>>>>> patches resulted in 7% degradation across spec benchmarks at -O2. For >>>>>> the >>>>>> benchmarks that did not have any IC sites, no performance degradation >>>>>> was >>>>>> observed. This data is gathered using the ref data set for spec. >>>>>> >>>>> >>>>> I'd like to make the runtime part of the change to be shared and used >>>>> as a general purpose value profiler (not just indirect call >>>>> promotion), but this can be done as a follow up. >>>>> >>>>> I will start with some reviews. Hopefully others will help with reviews >>>>> too. >>>>> >>>>> thanks, >>>>> >>>>> David >>>>> >>>>> >>>>> >>>>>> Thanks, >>>>>> -Betul Buyukkurt >>>>>> >>>>>> Qualcomm Innovation Center, Inc. >>>>>> Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a >>>>>> Linux >>>>>> Foundation Collaborative Project >>>>>> >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> LLVM Developers mailing list >>>>>> LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu >>>>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev >>>>>> >>>>> >>>>> _______________________________________________ >>>>> LLVM Developers mailing list >>>>> LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu >>>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev >>>> >>> >>>
Xinliang David Li <davidxl at google.com> writes:> On Thu, May 21, 2015 at 4:10 PM, Justin Bogner <mail at justinbogner.com> wrote: >> Hi Betul, >> >> I've finally gotten around to going over this in detail - sorry for the >> delay, and thanks for working on this. >> >> I think that the general approach is a good one, and that this will end >> up working well. I have a couple of points on a high level, and I'll >> also send some review for the patches you've sent out. >> >> - The raw profile format does not need to be backwards compatible, >> that's only important for the indexed one. Instead of adding >> RawInstrValueProfReader, just change RawInstrProfReader and reject the >> data if the version is 1. Similarly, don't change the raw profile >> magic - just bump the version to 2. > > I second this. > >> >> - We don't need to store the value profiling kind in the data at all. >> The frontend knows which invocations of the intrinsic are for each kind >> implicitly, much like it knows the difference between a counter for an >> "if" and a "for" apart implicitly. Just store one set of profiling data >> and let the frontend sort it out. > > I was going to suggest this and there is a better way: > > Note that: > > const IntPtrT Values[instr_value_prof_kind::last]; // NULL, used at runtime > > This field can actually be used to point to the start of the value > entries for each kind.This array of pointers adds complexity all over each of the patches in review. If it isn't necessary I'd much rather avoid it. If we do keep the kind, I'll have further comments on how to reduce that complexity (other than in the raw format), but I really don't think having the kind here is worthwhile in its current form.> On the other hand, storing value kind does not waste any space -- as > it is a 32bit integer (paired with counterIdx, another 32 bit > integer). > >> >> - Given that, the instrprof libraries needn't know about kinds at all, >> that can be dealt with locally in clang's CodeGenPGO, where we need to >> know what to do with the data. I'd just drop the kind completely for >> now. It'll be easy to add later without affecting much. > > This is doable, but at the cost of some flexibility in the future. For > instance for different value_profile_kind, the in-disk format may be > different -- e.g, for some cases we only care about storing 'average > value' per-site etc to reduce storage size.As implemented, all of the profile kinds need to be identical - to support kinds that have a different format we need to change the formats anyway. Adding the complexity now when we'll have to change it later *anyway* isn't worth it.>> >> - We should be able to statically allocate the first entry for each >> callsite for __llvm_profile_value_data. It will always be there, and >> for call sites that always call the same target it's nice to avoid the >> allocation. > > Since we need dynamic allocation any way, I don't see much benefit of > doing static allocation here. In my experience of large apps, a big > percentage (say >50%) of functions are really cold in the training run > which will never be exercised/called. Dynamic allocation has the > advantage of minimizing memory consumption.If >50% of the "functions that contain indirect calls" are never called, then I agree that the dynamic allocation is probably a better trade off. Note that this is functions that contain these calls, not "call sites", given the way we're dynamically allocating here. I don't have any data, but that seems superficially unlikely to me - I would expect training runs to have somewhat reasonable code coverage to be useful. OTOH, statically allocating one node assumes that indirect call sites that resolve to exactly one destination are reasonably common. Since allocating several times isn't much more expensive than allocating once, there isn't much benefit in avoiding the first allocation if the "exactly one destination" case is unlikely. The trade off isn't exactly clear cut, but elsewhere in the raw profile format we've gone the "low execution time overhead" direction, which I think is the right choice unless the memory cost is particularly bad.> thanks, > > David > > >> >> - The changes in the llvm and compiler-rt repos need their own tests. >> >> - Please use clang-format on your changes. Several of the patches have >> unusual or non-llvm-style whitespace/wrapping/etc. >> >> Finally, the LLVM patch is quite large. We can probably make it a bit >> easier to stage and review by splitting out a couple of things and doing >> them first: >> >> 1. Changing the __llvm_profile_data variables so they're no longer >> const. >> >> 2. Adding the InstrProfRecord type and updating the unittests. Also, >> this type should just go in the InstrProf.h header, no need for its >> own. >> >> 3. Arguably, we could update the raw format and implement the >> RawInstrProfReader changes (but ignoring the new data) first. If you >> think this is worthwhile go ahead, but if it'll lead to too much >> extra busywork it probably isn't worth it. >> >> I'm sending some review on the patches themselves as well, but I expect >> those to mostly be on the mechanical aspects since the high level points >> are already written down here. >> >> Xinliang David Li <davidxl at google.com> writes: >>> I have sent my review comments. I think most of my high level concerns >>> have been addressed (except for last few minor fix ups). >>> >>> Justin, do you have a chance to take a look? >>> >>> thanks, >>> >>> David >>> >>> On Wed, May 13, 2015 at 10:49 AM, Betul Buyukkurt >>> <betulb at codeaurora.org> wrote: >>>>> Xinliang David Li <davidxl at google.com> writes: >>>>>>> From: <betulb at codeaurora.org> >>>>>>> Date: Tue, Apr 7, 2015 at 12:44 PM >>>>>>> Subject: [LLVMdev] IC profiling infrastructure >>>>>>> To: llvmdev at cs.uiuc.edu >>>>>>> >>>>>>> >>>>>>> >>>>>>> Hi All, >>>>>>> >>>>>>> We had sent out an RFC in October on indirect call target profiling. >>>>>>> The >>>>>>> proposal was about profiling target addresses seen at indirect call >>>>>>> sites. >>>>>>> Using the profile data we're seeing up to %8 performance improvements >>>>>>> on >>>>>>> individual spec benchmarks where indirect call sites are present. We've >>>>>>> already started uploading our patches to the phabricator. I'm looking >>>>>>> forward to your reviews and comments on the code and ready to respond >>>>>>> to >>>>>>> your design related queries. >>>>>>> >>>>>>> There were few questions posted on the RFC that were not responded. >>>>>>> Here >>>>>>> are the much delayed comments. >>>>>>> >>>>>> >>>>>> Hi Betul, thank you for your patience. I have completed initial >>>>>> comparison with a few alternative value profile designs. My conclusion >>>>>> is that your proposed approach should well in practice. The study can >>>>>> be found here: >>>>>> https://docs.google.com/document/u/1/d/1k-_k_DLFBh8h3XMnPAi6za-XpmjOIPHX_x6UB6PULfw/pub >>>>> >>>>> Thanks for looking at this David. >>>>> >>>>> Betul: I also have some thoughts on the approach and implementation of >>>>> this, but haven't had a chance to go over it in detail. I hope to have >>>>> some feedback for you on all of this sometime next week, and I'll start >>>>> reviewing the individual patches after that. >>>> >>>> Hi All, >>>> >>>> I've posted three more patches yesterday. They might be missing some >>>> cosmetic fixes, but the support for profiling multiple value kinds have >>>> been added to the readers, writers and runtime. I'd appreciate your >>>> comments on the CL's. >>>> >>>> Thanks, >>>> -Betul >>>> >>>>> >>>>>>> 1) Added dependencies: Our implementation adds dependency on >>>>>>> calloc/free >>>>>>> as we’re generating/maintaining a linked list at run time. >>>>>> >>>>>> If it becomes a problem for some, there is a way to handle that -- but >>>>>> at a cost of more memory required (to be conservative). One of the >>>>>> good feature of using dynamic memory is that it allows counter array >>>>>> allocation on the fly which eliminates the need to allocate memory for >>>>>> lots of cold/unexecuted functions. >>>>>> >>>>>>> We also added >>>>>>> dependency on the usage of mutexes to prevent memory leaks in the case >>>>>>> multiple threads trying to insert a new target address for the same IC >>>>>>> site into the linked list. To least impact the performance we only >>>>>>> added >>>>>>> mutexes around the pointer assignment and kept any dynamic memory >>>>>>> allocation/free operations outside of the mutexed code. >>>>>> >>>>>> This (using mutexes) should be and can be avoided -- see the above >>>>>> report. >>>>>> >>>>>>> >>>>>>> 2) Indirect call data being present in sampling profile output: This is >>>>>>> unfortunately not helping in our case due to perf depending on lbr >>>>>>> support. To our knowledge lbr support is not present on ARM platforms. >>>>>>> >>>>>> >>>>>> yes. >>>>>> >>>>>>> 3) Losing profiling support on targets not supporting malloc/mutexes: >>>>>>> The >>>>>>> added dependency on calloc/free/mutexes may perhaps be eliminated >>>>>>> (although our current solution does not handle this) through having a >>>>>>> separate run time library for value profiling purposes. Instrumentation >>>>>>> can link in two run time libraries when value profiling (an instance of >>>>>>> it >>>>>>> being indirect call target profiling) is enabled on the command line. >>>>>> >>>>>> See above. >>>>>> >>>>>>> >>>>>>> 4) Performance of the instrumented code: Instrumentation with IC >>>>>>> profiling >>>>>>> patches resulted in 7% degradation across spec benchmarks at -O2. For >>>>>>> the >>>>>>> benchmarks that did not have any IC sites, no performance degradation >>>>>>> was >>>>>>> observed. This data is gathered using the ref data set for spec. >>>>>>> >>>>>> >>>>>> I'd like to make the runtime part of the change to be shared and used >>>>>> as a general purpose value profiler (not just indirect call >>>>>> promotion), but this can be done as a follow up. >>>>>> >>>>>> I will start with some reviews. Hopefully others will help with reviews >>>>>> too. >>>>>> >>>>>> thanks, >>>>>> >>>>>> David >>>>>> >>>>>> >>>>>> >>>>>>> Thanks, >>>>>>> -Betul Buyukkurt >>>>>>> >>>>>>> Qualcomm Innovation Center, Inc. >>>>>>> Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a >>>>>>> Linux >>>>>>> Foundation Collaborative Project >>>>>>> >>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> LLVM Developers mailing list >>>>>>> LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu >>>>>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev >>>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> LLVM Developers mailing list >>>>>> LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu >>>>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev >>>>> >>>> >>>>
> Hi Betul, > > I've finally gotten around to going over this in detail - sorry for the > delay, and thanks for working on this. > > I think that the general approach is a good one, and that this will end > up working well. I have a couple of points on a high level, and I'll > also send some review for the patches you've sent out. > > - The raw profile format does not need to be backwards compatible, > that's only important for the indexed one. Instead of adding > RawInstrValueProfReader, just change RawInstrProfReader and reject the > data if the version is 1. Similarly, don't change the raw profile > magic - just bump the version to 2.Thanks. Not remaining backwards compatible in the RawInstrProfReader removed much of the added complexity from the implementation. I'll upload my changes soon to the phabricator later today.> - We don't need to store the value profiling kind in the data at all. > The frontend knows which invocations of the intrinsic are for each kind > implicitly, much like it knows the difference between a counter for an > "if" and a "for" apart implicitly. Just store one set of profiling data > and let the frontend sort it out.I think so too. However, value_kind should stay around until the kinds of expressions whose values are computed are included in the hash computation. Otherwise, if a value-profiled expression is to be removed from source, then all the rest of the values would get attached to the MD fields of wrong types. Currently hash checking verifies if the counter assigned regions in the source has been changed across profile consumption runs.> - Given that, the instrprof libraries needn't know about kinds at all, > that can be dealt with locally in clang's CodeGenPGO, where we need to > know what to do with the data. I'd just drop the kind completely for > now. It'll be easy to add later without affecting much.I've not reflected this request in my latest changes that I'll upload later today. If we can come up with a mechanism to make sure the source changes on values profiled can be detected at function level, then I may bring in this change in my CLs.> - We should be able to statically allocate the first entry for each > callsite for __llvm_profile_value_data. It will always be there, and > for call sites that always call the same target it's nice to avoid the > allocation.I'm not in favor of this as dynamic allocation seem to be doing fine.> - The changes in the llvm and compiler-rt repos need their own tests.I'll add them in.> - Please use clang-format on your changes. Several of the patches have > unusual or non-llvm-style whitespace/wrapping/etc.I'll do this as well.> Finally, the LLVM patch is quite large. We can probably make it a bit > easier to stage and review by splitting out a couple of things and doing > them first: > > 1. Changing the __llvm_profile_data variables so they're no longer > const. > > 2. Adding the InstrProfRecord type and updating the unittests. Also, > this type should just go in the InstrProf.h header, no need for its > own. > > 3. Arguably, we could update the raw format and implement the > RawInstrProfReader changes (but ignoring the new data) first. If you > think this is worthwhile go ahead, but if it'll lead to too much > extra busywork it probably isn't worth it. > > I'm sending some review on the patches themselves as well, but I expect > those to mostly be on the mechanical aspects since the high level points > are already written down here.I'll first update current CL's responding to the review comments and then start uploading the CL's in the above recommended order. -Betul> Xinliang David Li <davidxl at google.com> writes: >> I have sent my review comments. I think most of my high level concerns >> have been addressed (except for last few minor fix ups). >> >> Justin, do you have a chance to take a look? >> >> thanks, >> >> David >> >> On Wed, May 13, 2015 at 10:49 AM, Betul Buyukkurt >> <betulb at codeaurora.org> wrote: >>>> Xinliang David Li <davidxl at google.com> writes: >>>>>> From: <betulb at codeaurora.org> >>>>>> Date: Tue, Apr 7, 2015 at 12:44 PM >>>>>> Subject: [LLVMdev] IC profiling infrastructure >>>>>> To: llvmdev at cs.uiuc.edu >>>>>> >>>>>> >>>>>> >>>>>> Hi All, >>>>>> >>>>>> We had sent out an RFC in October on indirect call target profiling. >>>>>> The >>>>>> proposal was about profiling target addresses seen at indirect call >>>>>> sites. >>>>>> Using the profile data we're seeing up to %8 performance >>>>>> improvements >>>>>> on >>>>>> individual spec benchmarks where indirect call sites are present. >>>>>> We've >>>>>> already started uploading our patches to the phabricator. I'm >>>>>> looking >>>>>> forward to your reviews and comments on the code and ready to >>>>>> respond >>>>>> to >>>>>> your design related queries. >>>>>> >>>>>> There were few questions posted on the RFC that were not responded. >>>>>> Here >>>>>> are the much delayed comments. >>>>>> >>>>> >>>>> Hi Betul, thank you for your patience. I have completed initial >>>>> comparison with a few alternative value profile designs. My >>>>> conclusion >>>>> is that your proposed approach should well in practice. The study can >>>>> be found here: >>>>> https://docs.google.com/document/u/1/d/1k-_k_DLFBh8h3XMnPAi6za-XpmjOIPHX_x6UB6PULfw/pub >>>> >>>> Thanks for looking at this David. >>>> >>>> Betul: I also have some thoughts on the approach and implementation of >>>> this, but haven't had a chance to go over it in detail. I hope to have >>>> some feedback for you on all of this sometime next week, and I'll >>>> start >>>> reviewing the individual patches after that. >>> >>> Hi All, >>> >>> I've posted three more patches yesterday. They might be missing some >>> cosmetic fixes, but the support for profiling multiple value kinds have >>> been added to the readers, writers and runtime. I'd appreciate your >>> comments on the CL's. >>> >>> Thanks, >>> -Betul >>> >>>> >>>>>> 1) Added dependencies: Our implementation adds dependency on >>>>>> calloc/free >>>>>> as weâre generating/maintaining a linked list at run time. >>>>> >>>>> If it becomes a problem for some, there is a way to handle that -- >>>>> but >>>>> at a cost of more memory required (to be conservative). One of the >>>>> good feature of using dynamic memory is that it allows counter array >>>>> allocation on the fly which eliminates the need to allocate memory >>>>> for >>>>> lots of cold/unexecuted functions. >>>>> >>>>>> We also added >>>>>> dependency on the usage of mutexes to prevent memory leaks in the >>>>>> case >>>>>> multiple threads trying to insert a new target address for the same >>>>>> IC >>>>>> site into the linked list. To least impact the performance we only >>>>>> added >>>>>> mutexes around the pointer assignment and kept any dynamic memory >>>>>> allocation/free operations outside of the mutexed code. >>>>> >>>>> This (using mutexes) should be and can be avoided -- see the above >>>>> report. >>>>> >>>>>> >>>>>> 2) Indirect call data being present in sampling profile output: This >>>>>> is >>>>>> unfortunately not helping in our case due to perf depending on lbr >>>>>> support. To our knowledge lbr support is not present on ARM >>>>>> platforms. >>>>>> >>>>> >>>>> yes. >>>>> >>>>>> 3) Losing profiling support on targets not supporting >>>>>> malloc/mutexes: >>>>>> The >>>>>> added dependency on calloc/free/mutexes may perhaps be eliminated >>>>>> (although our current solution does not handle this) through having >>>>>> a >>>>>> separate run time library for value profiling purposes. >>>>>> Instrumentation >>>>>> can link in two run time libraries when value profiling (an instance >>>>>> of >>>>>> it >>>>>> being indirect call target profiling) is enabled on the command >>>>>> line. >>>>> >>>>> See above. >>>>> >>>>>> >>>>>> 4) Performance of the instrumented code: Instrumentation with IC >>>>>> profiling >>>>>> patches resulted in 7% degradation across spec benchmarks at -O2. >>>>>> For >>>>>> the >>>>>> benchmarks that did not have any IC sites, no performance >>>>>> degradation >>>>>> was >>>>>> observed. This data is gathered using the ref data set for spec. >>>>>> >>>>> >>>>> I'd like to make the runtime part of the change to be shared and used >>>>> as a general purpose value profiler (not just indirect call >>>>> promotion), but this can be done as a follow up. >>>>> >>>>> I will start with some reviews. Hopefully others will help with >>>>> reviews >>>>> too. >>>>> >>>>> thanks, >>>>> >>>>> David >>>>> >>>>> >>>>> >>>>>> Thanks, >>>>>> -Betul Buyukkurt >>>>>> >>>>>> Qualcomm Innovation Center, Inc. >>>>>> Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a >>>>>> Linux >>>>>> Foundation Collaborative Project >>>>>> >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> LLVM Developers mailing list >>>>>> LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu >>>>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev >>>>>> >>>>> >>>>> _______________________________________________ >>>>> LLVM Developers mailing list >>>>> LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu >>>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev >>>> >>> >>> >
On Mon, Jun 1, 2015 at 9:22 AM, Betul Buyukkurt <betulb at codeaurora.org> wrote:> > Hi Betul, > > > > I've finally gotten around to going over this in detail - sorry for the > > delay, and thanks for working on this. > > > > I think that the general approach is a good one, and that this will end > > up working well. I have a couple of points on a high level, and I'll > > also send some review for the patches you've sent out. > > > > - The raw profile format does not need to be backwards compatible, > > that's only important for the indexed one. Instead of adding > > RawInstrValueProfReader, just change RawInstrProfReader and reject the > > data if the version is 1. Similarly, don't change the raw profile > > magic - just bump the version to 2. > > Thanks. Not remaining backwards compatible in the RawInstrProfReader > removed much of the added complexity from the implementation. I'll upload > my changes soon to the phabricator later today. > > > - We don't need to store the value profiling kind in the data at all. > > The frontend knows which invocations of the intrinsic are for each kind > > implicitly, much like it knows the difference between a counter for an > > "if" and a "for" apart implicitly. Just store one set of profiling data > > and let the frontend sort it out. > > I think so too. However, value_kind should stay around until the kinds of > expressions whose values are computed are included in the hash > computation. Otherwise, if a value-profiled expression is to be removed > from source, then all the rest of the values would get attached to the MD > fields of wrong types. Currently hash checking verifies if the counter > assigned regions in the source has been changed across profile consumption > runs. >yes -- separating counter arrays from different value kind is still very useful, for instance profile-generate and profile-use compilations can turn on different set of value profile options and can still work. However, by organizing the in-memory and persistent profile data structure properly, we don't need to record the value kind, nor counter index information in the value profile entries. See my most recent review comments. David> > > - Given that, the instrprof libraries needn't know about kinds at all, > > that can be dealt with locally in clang's CodeGenPGO, where we need to > > know what to do with the data. I'd just drop the kind completely for > > now. It'll be easy to add later without affecting much. > > I've not reflected this request in my latest changes that I'll upload > later today. If we can come up with a mechanism to make sure the source > changes on values profiled can be detected at function level, then I may > bring in this change in my CLs. > > > - We should be able to statically allocate the first entry for each > > callsite for __llvm_profile_value_data. It will always be there, and > > for call sites that always call the same target it's nice to avoid the > > allocation. > > I'm not in favor of this as dynamic allocation seem to be doing fine. > > > - The changes in the llvm and compiler-rt repos need their own tests. > > I'll add them in. > > > - Please use clang-format on your changes. Several of the patches have > > unusual or non-llvm-style whitespace/wrapping/etc. > > I'll do this as well. > > > Finally, the LLVM patch is quite large. We can probably make it a bit > > easier to stage and review by splitting out a couple of things and doing > > them first: > > > > 1. Changing the __llvm_profile_data variables so they're no longer > > const. > > > > 2. Adding the InstrProfRecord type and updating the unittests. Also, > > this type should just go in the InstrProf.h header, no need for its > > own. > > > > 3. Arguably, we could update the raw format and implement the > > RawInstrProfReader changes (but ignoring the new data) first. If you > > think this is worthwhile go ahead, but if it'll lead to too much > > extra busywork it probably isn't worth it. > > > > I'm sending some review on the patches themselves as well, but I expect > > those to mostly be on the mechanical aspects since the high level points > > are already written down here. > > I'll first update current CL's responding to the review comments and then > start uploading the CL's in the above recommended order. > > -Betul > > > Xinliang David Li <davidxl at google.com> writes: > >> I have sent my review comments. I think most of my high level concerns > >> have been addressed (except for last few minor fix ups). > >> > >> Justin, do you have a chance to take a look? > >> > >> thanks, > >> > >> David > >> > >> On Wed, May 13, 2015 at 10:49 AM, Betul Buyukkurt > >> <betulb at codeaurora.org> wrote: > >>>> Xinliang David Li <davidxl at google.com> writes: > >>>>>> From: <betulb at codeaurora.org> > >>>>>> Date: Tue, Apr 7, 2015 at 12:44 PM > >>>>>> Subject: [LLVMdev] IC profiling infrastructure > >>>>>> To: llvmdev at cs.uiuc.edu > >>>>>> > >>>>>> > >>>>>> > >>>>>> Hi All, > >>>>>> > >>>>>> We had sent out an RFC in October on indirect call target profiling. > >>>>>> The > >>>>>> proposal was about profiling target addresses seen at indirect call > >>>>>> sites. > >>>>>> Using the profile data we're seeing up to %8 performance > >>>>>> improvements > >>>>>> on > >>>>>> individual spec benchmarks where indirect call sites are present. > >>>>>> We've > >>>>>> already started uploading our patches to the phabricator. I'm > >>>>>> looking > >>>>>> forward to your reviews and comments on the code and ready to > >>>>>> respond > >>>>>> to > >>>>>> your design related queries. > >>>>>> > >>>>>> There were few questions posted on the RFC that were not responded. > >>>>>> Here > >>>>>> are the much delayed comments. > >>>>>> > >>>>> > >>>>> Hi Betul, thank you for your patience. I have completed initial > >>>>> comparison with a few alternative value profile designs. My > >>>>> conclusion > >>>>> is that your proposed approach should well in practice. The study can > >>>>> be found here: > >>>>> > https://docs.google.com/document/u/1/d/1k-_k_DLFBh8h3XMnPAi6za-XpmjOIPHX_x6UB6PULfw/pub > >>>> > >>>> Thanks for looking at this David. > >>>> > >>>> Betul: I also have some thoughts on the approach and implementation of > >>>> this, but haven't had a chance to go over it in detail. I hope to have > >>>> some feedback for you on all of this sometime next week, and I'll > >>>> start > >>>> reviewing the individual patches after that. > >>> > >>> Hi All, > >>> > >>> I've posted three more patches yesterday. They might be missing some > >>> cosmetic fixes, but the support for profiling multiple value kinds have > >>> been added to the readers, writers and runtime. I'd appreciate your > >>> comments on the CL's. > >>> > >>> Thanks, > >>> -Betul > >>> > >>>> > >>>>>> 1) Added dependencies: Our implementation adds dependency on > >>>>>> calloc/free > >>>>>> as we’re generating/maintaining a linked list at run time. > >>>>> > >>>>> If it becomes a problem for some, there is a way to handle that -- > >>>>> but > >>>>> at a cost of more memory required (to be conservative). One of the > >>>>> good feature of using dynamic memory is that it allows counter array > >>>>> allocation on the fly which eliminates the need to allocate memory > >>>>> for > >>>>> lots of cold/unexecuted functions. > >>>>> > >>>>>> We also added > >>>>>> dependency on the usage of mutexes to prevent memory leaks in the > >>>>>> case > >>>>>> multiple threads trying to insert a new target address for the same > >>>>>> IC > >>>>>> site into the linked list. To least impact the performance we only > >>>>>> added > >>>>>> mutexes around the pointer assignment and kept any dynamic memory > >>>>>> allocation/free operations outside of the mutexed code. > >>>>> > >>>>> This (using mutexes) should be and can be avoided -- see the above > >>>>> report. > >>>>> > >>>>>> > >>>>>> 2) Indirect call data being present in sampling profile output: This > >>>>>> is > >>>>>> unfortunately not helping in our case due to perf depending on lbr > >>>>>> support. To our knowledge lbr support is not present on ARM > >>>>>> platforms. > >>>>>> > >>>>> > >>>>> yes. > >>>>> > >>>>>> 3) Losing profiling support on targets not supporting > >>>>>> malloc/mutexes: > >>>>>> The > >>>>>> added dependency on calloc/free/mutexes may perhaps be eliminated > >>>>>> (although our current solution does not handle this) through having > >>>>>> a > >>>>>> separate run time library for value profiling purposes. > >>>>>> Instrumentation > >>>>>> can link in two run time libraries when value profiling (an instance > >>>>>> of > >>>>>> it > >>>>>> being indirect call target profiling) is enabled on the command > >>>>>> line. > >>>>> > >>>>> See above. > >>>>> > >>>>>> > >>>>>> 4) Performance of the instrumented code: Instrumentation with IC > >>>>>> profiling > >>>>>> patches resulted in 7% degradation across spec benchmarks at -O2. > >>>>>> For > >>>>>> the > >>>>>> benchmarks that did not have any IC sites, no performance > >>>>>> degradation > >>>>>> was > >>>>>> observed. This data is gathered using the ref data set for spec. > >>>>>> > >>>>> > >>>>> I'd like to make the runtime part of the change to be shared and used > >>>>> as a general purpose value profiler (not just indirect call > >>>>> promotion), but this can be done as a follow up. > >>>>> > >>>>> I will start with some reviews. Hopefully others will help with > >>>>> reviews > >>>>> too. > >>>>> > >>>>> thanks, > >>>>> > >>>>> David > >>>>> > >>>>> > >>>>> > >>>>>> Thanks, > >>>>>> -Betul Buyukkurt > >>>>>> > >>>>>> Qualcomm Innovation Center, Inc. > >>>>>> Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a > >>>>>> Linux > >>>>>> Foundation Collaborative Project > >>>>>> > >>>>>> > >>>>>> > >>>>>> _______________________________________________ > >>>>>> LLVM Developers mailing list > >>>>>> LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > >>>>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev > >>>>>> > >>>>> > >>>>> _______________________________________________ > >>>>> LLVM Developers mailing list > >>>>> LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > >>>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev > >>>> > >>> > >>> > > > > > > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150601/f0c4a928/attachment.html>
"Betul Buyukkurt" <betulb at codeaurora.org> writes:>> - We don't need to store the value profiling kind in the data at all. >> The frontend knows which invocations of the intrinsic are for each kind >> implicitly, much like it knows the difference between a counter for an >> "if" and a "for" apart implicitly. Just store one set of profiling data >> and let the frontend sort it out. > > I think so too. However, value_kind should stay around until the kinds of > expressions whose values are computed are included in the hash > computation. Otherwise, if a value-profiled expression is to be removed > from source, then all the rest of the values would get attached to the MD > fields of wrong types. Currently hash checking verifies if the counter > assigned regions in the source has been changed across profile consumption > runs.This isn't a good reason to keep value_kind - the function hash not reflecting these kinds of changes is a problem regardless of whether or not we have multiple types. We should just add hash computations for these indirect call counters. As long as we only add this when indirect profiling is turned on it won't invalidate old hashes so it's trivial to stay backwards compatible. The other thing we should do is store which profiling options are enabled, in both formats. When we specify profile-instr-use it'll be less error prone if we can detect whether or not this includes indirect call profiling without checking other options, and it's probably a good idea for "llvm-profdata merge" to disallow merging profiles that are gathering different sets of data. A bitfield seems suitable for this.