Xinliang David Li via llvm-dev
2015-Oct-08 06:12 UTC
[llvm-dev] RFC: Reducing Instr PGO size overhead
There is no further response to this, so I will assume general direction of solution-3 is acceptable ;) Solution-3 can be further improved. Instead of using static symbol table (with zero size __llvm_prf_nm symbols) to store function names for profile display and coverage mapping, the function names can be stored compressed in a non-allocatable section. The compression ratio for function name strings can be very high (~5x). The covmapping data can also be made non-allocatable. thanks, David On Tue, Sep 29, 2015 at 9:52 AM, Xinliang David Li <davidxl at google.com> wrote:> Sorry for the late update. I finally found time to implement a solution > (Solution-3) that has the best size savings (for both PGO and coverage > testing) with symbolic information available. Here is a re-cap of what we > have discussed so far: > > Solution-1: > > This is the original solution proposed. In this solution, A function name's > MD5 hash is used as the primary key (combined with IR hash) for function > lookup. __llvm_prf_names section won't be emitted into the binary nor dumped > into the profile data unless -fcoverage-mapping is also specified. > > Pros: > 1. maximal reduction of instrumentation binary process image size > 2. maximal reduction of object and unstripped binary size > 3. maximal reduction of raw profile data size > 4. maximal reduction of indexed profile data size > > Cons: > 1. -fcoverage-mapping use case is not handled -- the size problem > still exist > 2. profile dump with llvm-profdata no longer have function names > associated -- user needs to use postprocessing tool to get the functionality > 3. function filtering with partial function name is not supported > 4. Requires incompatible format change > > > Solution-2: (http://reviews.llvm.org/D12715) > > In this solution, the MD5 hash string is used to replace the raw name string > Pros: > 1. Very simple to implement > 2. have good reduction of all sizes for typical large C++ > applications > 3. No profile data format change is required. > > Cons: > 1. Still requires 16byte overhead per-function -- can actually > hurt C programs > 2. -fcoverage-mapping use case is still not handled > 3. The problem with llvm-profdata still exists (no symbolic info, > partial filtering support) > > > Solution-3: > > This is the new solution I am proposing. It is basically an enhancement of > Solution-1 with most of the weakness resolved. The difference with > Solution-1 is > 1. Function name symbols are emitted into the symbol table as weak > externs. They don't occupy any space at runtime and can be easily stripped. > 2. -fcoverage-mapping does not need special handling -- it > automatically benefit from the same size saving. > 3. llvm-cov is changed to read symbol info from the symtab instead of > reading them from the name section data > 4. llvm-profdata is enhanced to take a binary as input and dump > profile with names attached. Function filtering is fully supported (option > can also be introduced to force dumping names into binary and profile data, > so that llvm-profdata use case is not changed at all). > > Pros: > 1. All the pros from Solution-1 > 2. Size savings for coverage-mapping case > Cons: > Format change is required for profile data and coverage mapping. > > The initial patch is here: http://reviews.llvm.org/D13251 > > With this patch, the size of a release clang binary with coverage mapping is > reduced from 986M to 569M. > > If there are no major concerns, I will carve out the patch into smaller > ones for review. > > thanks, > > David > > > > > > > > On Tue, Sep 8, 2015 at 3:47 PM, Xinliang David Li <davidxl at google.com> > wrote: >> >> >> >> >> >> >> yes -- it is fixed length (8byte) blob which may include null byte >> >> >> in >> >> >> the middle. >> >> > >> >> > >> >> > For reference, MD5 sum is 16 bytes (128-bit): >> >> > https://en.wikipedia.org/wiki/MD5 >> >> >> >> yes, LLVM's MD5 hash only takes the lower 64bit. >> >> >> >> >> >> > >> >> >> >> >> >> >> >> >> > >> >> >> > Or to say it another way, suppose that Itanium mangling required >> >> >> > as a >> >> >> > final >> >> >> > step to replace the string with its md5 sum in hex. Therefore all >> >> >> > symbol >> >> >> > names are "small". My understanding is that this is effectively >> >> >> > all >> >> >> > your >> >> >> > patch is doing. >> >> >> >> >> >> The key type before the change is StringRef, while the after the key >> >> >> type is uint64_t. Are you suggesting treating uint64_t md5 sum key >> >> >> as >> >> >> a string of 8 bytes or storing md5 has in text form which will >> >> >> double >> >> >> the size? >> >> > >> >> > >> >> > How much does this change the benefit? If most of the benefit is >> >> > avoiding >> >> > extraordinarily long mangled names then it may be sufficient. >> >> > >> >> > With IR-level instrumentation like Rong is pursuing the size may be >> >> > reduced >> >> > sufficiently that we do not need the optimization proposed in this >> >> > thread. >> >> > For example, Rong found >2x size reduction on Google's C++ >> >> > benchmarks, >> >> > which >> >> > I assume are representative of the extremely large Google binaries >> >> > that >> >> > are >> >> > causing the problems addressed by your proposal in this thread. The >> >> > measurements you mention for Clang in this thread provide similar >> >> > size >> >> > reductions, so Rong's approach may be sufficient (especially because >> >> > functions with extremely large mangled names tend to be small inline >> >> > functions in header-only template libraries). >> >> >> >> Late instrumentation helps many cases. In some cases (as shown in >> >> SPEC), the reduction in size is not as large. Reducing PGO overhead >> >> will lower the bar for its adoption. >> >> >> >> > >> >> > Of the points you mention in "Large size of overhead can limit the >> >> > usability >> >> > of PGO greatly", many of the issues are hard limits that prevent the >> >> > use >> >> > of >> >> > PGO. Do you have a lower bound on how much the size of the PGO data >> >> > must >> >> > be >> >> > reduced in order to overcome the hard limits? >> >> >> >> This is a static view: Think about the situation where application >> >> size is ever increasing; also think about situation where we want to >> >> >> >> collect more types of profile data. Think about situation where user >> >> want to run pgo binaries on small devices with tiny memory/storage .. >> > >> > >> > If we want to reduce memory overhead at runtime and reduce the size of >> > the >> > raw profile data extracted from the target, there are clear solutions. >> > Consider that debug info does not need to be loaded into the memory >> > image of >> > the target; why should information identifying each counter need to be? >> > A file containing raw profile counters is a subset of a core dump; in >> > most >> > environments, a core dump does not need to have debug info or symbol >> > names >> > in it, but can be still be read in full detail in conjunction with the >> > original binary. >> >> Yes -- there are many alternatives: >> 1) emit the name key mapping as a side data at compile time, or >> 2) emit them into nonloadable sections of the object file. >> >> Compared with the above, LLVM's existing design does have its own >> advantage -- making it easier for tool to access 'debug' info for >> counters. >> >> LLVM's coverage testing, on the other hand, take a hybrid approach: It >> emits the coverage map as rodata, but does not pass it to the profile >> dumper. I think it is better to emit covmap as a side data not >> attached to target binary. >> >> >> > >> > Thus, as we require that the binary be passed to llvm-profdata, there is >> > no >> > fundamental reason that the memory image of the program, or the raw data >> > extracted from the program, must have any size overhead besides the raw >> > values of the counters themselves and any text size increase for >> > incrementing them. If we are willing to impose this requirement on >> > users, >> > then as far as reducing memory overhead at runtime and reducing the size >> > of >> > the raw profile data extracted from the target, using hashed function >> > names >> > is clearly the wrong direction. >> > >> > *Without* imposing the requirement of passing the binary to >> > llvm-profdata, I >> > do like the ability to use hashed function names like you are proposing. >> > It >> > is a simple solution for reducing size overhead of function name strings >> > with little complexity, as it is just swapping one string for another. >> >> Agree. The good news is that the overhead of hashed function names is >> small enough that makes this approach attractive. >> >> thanks, >> >> David >> > >> >> >> >> >> >> > >> >> > Obviously LLVM must be able to support the extremely large binaries >> >> > in >> >> > your >> >> > configuration (otherwise what use is LLVM as a compiler ;) My >> >> > questions >> >> > are >> >> > primarily aimed at establishing which tradeoffs are acceptable for >> >> > supporting this (both for LLVM and for you guys). >> >> >> >> As I said, with the modified proposal (after getting your feedback), >> >> no PGO users in LLVM land is going to lose anything/functionality. The >> >> end result will be net win for general users of LLVM (even though your >> >> customers don't care about it), not just 'us' as you have mentioned >> >> many times. >> >> >> >> > >> >> > Btw, for us, the issue of PGO data size is not completely immaterial >> >> > but >> >> > is >> >> > very different from your use case. For us, the primary issue is the >> >> > additional memory use at run time, since PS4 games usually use "all" >> >> > available memory. We had a problem with UBSan where the large amount >> >> > of >> >> > memory required for storing the UBSan diagnostic data at runtime >> >> > required >> >> > the game programmers to manually change their memory map to make >> >> > room. >> >> > +Filipe, do you remember how much memory UBSan was using that caused >> >> > a >> >> > problem? >> >> > >> >> >> >> My proposal does help reducing rodata size significantly. >> > >> > >> > Yes, that is why I think that this is a useful thing to do. I just want >> > to >> > be careful about existing use cases and the relevant workflow issues. >> > >> > -- Sean Silva >> > >> >> >> >> >> >> David >> >> >> >> >> >> > -- Sean Silva >> >> > >> >> >> >> >> >> >> >> >> In the raw format, md5 sum key can be an embedded field in the >> >> >> prf_data variable instead of as different var referenced by >> >> >> prf_data. >> >> >> >> >> >> > >> >> >> > If this is not the case, you should show your current patch so >> >> >> > that >> >> >> > we >> >> >> > can >> >> >> > discuss things concretely. >> >> >> >> >> >> It is not. See above about the difference. >> >> >> >> >> >> > >> >> >> >> >> >> >> >> > >> >> >> >> > >> >> >> >> >> >> >> >> >> >> It will be >> >> >> >> >> very messy to support multiple formats in instr-codegen and >> >> >> >> >> instr-runtime. For compatibility concerns, the reader is >> >> >> >> >> taught >> >> >> >> >> to >> >> >> >> >> support previous format, but the changes there are isolated >> >> >> >> >> (also >> >> >> >> >> expected to be removed in the future). >> >> >> >> >> >> >> >> >> >> > >> >> >> >> >> > My primary concern is that if the function name are not kept >> >> >> >> >> > at >> >> >> >> >> > all >> >> >> >> >> > stages, >> >> >> >> >> > then it becomes difficult to analyze the profile data in a >> >> >> >> >> > standalone >> >> >> >> >> > way. >> >> >> >> >> > Many times, I have used `llvm-profdata show -all-functions >> >> >> >> >> > foo.profdata` >> >> >> >> >> > on >> >> >> >> >> > the resulting profile data and then imported that data into >> >> >> >> >> > Mathematica >> >> >> >> >> > for >> >> >> >> >> > analysis. >> >> >> >> >> >> >> >> >> >> This is certainly a very valid use case. >> >> >> >> >> >> >> >> >> >> >My understanding of your proposal is that `llvm-profdata show >> >> >> >> >> > -all-functions foo.profdata` will not show the actual >> >> >> >> >> > function >> >> >> >> >> > names >> >> >> >> >> > but >> >> >> >> >> > instead MD5 hashes, >> >> >> >> >> >> >> >> >> >> Yes. >> >> >> >> >> >> >> >> >> >> To support your use case, there are two solutions: >> >> >> >> >> >> >> >> >> >> 1) user can add -fcoverage-mapping option in the build >> >> >> >> >> 2) introduce a new option -fprofile-instr-names that force the >> >> >> >> >> emission of the name sections in the .o file. This is similar >> >> >> >> >> to >> >> >> >> >> 1), >> >> >> >> >> but no covmap section is needed. >> >> >> >> >> >> >> >> >> >> llvm-profdata tool will be taught to read the name section >> >> >> >> >> and >> >> >> >> >> attach >> >> >> >> >> function names to the profile records. >> >> >> >> > >> >> >> >> > >> >> >> >> > Needing to pass the executable to llvm-profdata would cause >> >> >> >> > deployment >> >> >> >> > issues for my customers in practice. >> >> >> >> >> >> >> >> Why? The deployment needs to pass the profile data anyway right? >> >> >> > >> >> >> > >> >> >> > Yes, but not the executable. >> >> >> > >> >> >> > The PGO training run is likely being run by a gameplay tester >> >> >> > (non-programmer). In general the binary will not be lying around >> >> >> > as a >> >> >> > loose >> >> >> > file anywhere, it will be part of a full package of the >> >> >> > binary+assets >> >> >> > (think >> >> >> > like what will end up on a bluray disc). A game's binary >> >> >> > *completely >> >> >> > useless* without the assets, so except locally on a programmer's >> >> >> > machine >> >> >> > while they iterate/debug, there is no reason for a binary to ever >> >> >> > exist >> >> >> > as a >> >> >> > standalone file. >> >> >> > >> >> >> > I'm not saying that needing the binary is insurmountable in any >> >> >> > particular >> >> >> > scenario. Just that it will cause a strict increase in the number >> >> >> > of >> >> >> > issues >> >> >> > to deploying PGO. >> >> >> >> >> >> Your concern is acknowledged. >> >> >> >> >> >> > >> >> >> > These are much bigger "compatibility concerns" for me than for >> >> >> > newer >> >> >> > toolchains to accept the old format. For a change in format I can >> >> >> > easily >> >> >> > tell my users to replace an exe with a newer one and that is all >> >> >> > they >> >> >> > need >> >> >> > to do and it takes 10 seconds, guaranteed. A workflow change is >> >> >> > potentially >> >> >> > a massive disruption and guaranteed to take more than 10 seconds >> >> >> > to >> >> >> > fix >> >> >> > (perhaps hours or days). >> >> >> >> >> >> ok. >> >> >> >> >> >> > >> >> >> > >> >> >> >> >> >> >> >> This >> >> >> >> is no different from llvm-cov usage model. >> >> >> > >> >> >> > >> >> >> > In practice, getting the performance of PGO is a higher priority >> >> >> > for >> >> >> > my >> >> >> > users, so we should not assume that llvm-cov is being used. >> >> >> >> >> >> Glad to hear that :) >> >> >> >> >> >> thanks, >> >> >> >> >> >> David >> >> >> >> >> >> > >> >> >> > -- Sean Silva >> >> >> > >> >> >> >> >> >> >> >> >> >> >> >> David >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> > >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> Note that with 1) or 2), the user can still benefit from the >> >> >> >> >> reduced >> >> >> >> >> profile size. >> >> >> >> > >> >> >> >> > >> >> >> >> > Let me reiterate that the size of the profile is not a problem >> >> >> >> > I >> >> >> >> > have >> >> >> >> > observed in practice (nor have I heard of this being a problem >> >> >> >> > in >> >> >> >> > practice >> >> >> >> > until this thread). Therefore I'm skeptical of any changes to >> >> >> >> > our >> >> >> >> > default >> >> >> >> > behavior or any new requirements that are not opt-in. >> >> >> >> > >> >> >> >> > -- Sean Silva >> >> >> >> > >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> thanks, >> >> >> >> >> >> >> >> >> >> David >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >which will make it more difficult for me to do this kind >> >> >> >> >> > of analysis (would require using nm on the original binary, >> >> >> >> >> > hashing >> >> >> >> >> > everything, etc.). >> >> >> >> >> > >> >> >> >> >> > btw, feel free to attach the patch even if it in a rough >> >> >> >> >> > state. >> >> >> >> >> > It >> >> >> >> >> > can >> >> >> >> >> > still >> >> >> >> >> > help to clarify the proposal and be a good talking point. >> >> >> >> >> > Fine-grained >> >> >> >> >> > patch >> >> >> >> >> > review for caring about the rough parts will happen on >> >> >> >> >> > llvm-commits; >> >> >> >> >> > the >> >> >> >> >> > rough parts will not distract the discussion here on >> >> >> >> >> > llvm-dev. >> >> >> >> >> > >> >> >> >> >> > -- Sean Silva >> >> >> >> >> > >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> thanks, >> >> >> >> >> >> >> >> >> >> >> >> David >> >> >> >> >> >> _______________________________________________ >> >> >> >> >> >> LLVM Developers mailing list >> >> >> >> >> >> llvm-dev at lists.llvm.org >> >> >> >> >> >> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >> >> >> >> >> > >> >> >> >> >> > >> >> >> >> > >> >> >> >> > >> >> >> > >> >> >> > >> >> > >> >> > >> > >> > > >
Sean Silva via llvm-dev
2015-Oct-09 22:58 UTC
[llvm-dev] RFC: Reducing Instr PGO size overhead
On Wed, Oct 7, 2015 at 11:12 PM, Xinliang David Li <davidxl at google.com> wrote:> There is no further response to this, so I will assume general > direction of solution-3 is acceptable ;) >No response does not mean "LGTM". -- Sean Silva> > Solution-3 can be further improved. Instead of using static symbol > table (with zero size __llvm_prf_nm symbols) to store function names > for profile display and coverage mapping, the function names can be > stored compressed in a non-allocatable section. The compression ratio > for function name strings can be very high (~5x). The covmapping data > can also be made non-allocatable. > > thanks, > > David > > On Tue, Sep 29, 2015 at 9:52 AM, Xinliang David Li <davidxl at google.com> > wrote: > > Sorry for the late update. I finally found time to implement a solution > > (Solution-3) that has the best size savings (for both PGO and coverage > > testing) with symbolic information available. Here is a re-cap of what > we > > have discussed so far: > > > > Solution-1: > > > > This is the original solution proposed. In this solution, A function > name's > > MD5 hash is used as the primary key (combined with IR hash) for function > > lookup. __llvm_prf_names section won't be emitted into the binary nor > dumped > > into the profile data unless -fcoverage-mapping is also specified. > > > > Pros: > > 1. maximal reduction of instrumentation binary process image size > > 2. maximal reduction of object and unstripped binary size > > 3. maximal reduction of raw profile data size > > 4. maximal reduction of indexed profile data size > > > > Cons: > > 1. -fcoverage-mapping use case is not handled -- the size problem > > still exist > > 2. profile dump with llvm-profdata no longer have function names > > associated -- user needs to use postprocessing tool to get the > functionality > > 3. function filtering with partial function name is not supported > > 4. Requires incompatible format change > > > > > > Solution-2: (http://reviews.llvm.org/D12715) > > > > In this solution, the MD5 hash string is used to replace the raw name > string > > Pros: > > 1. Very simple to implement > > 2. have good reduction of all sizes for typical large C++ > > applications > > 3. No profile data format change is required. > > > > Cons: > > 1. Still requires 16byte overhead per-function -- can actually > > hurt C programs > > 2. -fcoverage-mapping use case is still not handled > > 3. The problem with llvm-profdata still exists (no symbolic > info, > > partial filtering support) > > > > > > Solution-3: > > > > This is the new solution I am proposing. It is basically an enhancement > of > > Solution-1 with most of the weakness resolved. The difference with > > Solution-1 is > > 1. Function name symbols are emitted into the symbol table as weak > > externs. They don't occupy any space at runtime and can be easily > stripped. > > 2. -fcoverage-mapping does not need special handling -- it > > automatically benefit from the same size saving. > > 3. llvm-cov is changed to read symbol info from the symtab instead > of > > reading them from the name section data > > 4. llvm-profdata is enhanced to take a binary as input and dump > > profile with names attached. Function filtering is fully supported > (option > > can also be introduced to force dumping names into binary and profile > data, > > so that llvm-profdata use case is not changed at all). > > > > Pros: > > 1. All the pros from Solution-1 > > 2. Size savings for coverage-mapping case > > Cons: > > Format change is required for profile data and coverage mapping. > > > > The initial patch is here: http://reviews.llvm.org/D13251 > > > > With this patch, the size of a release clang binary with coverage > mapping is > > reduced from 986M to 569M. > > > > If there are no major concerns, I will carve out the patch into smaller > > ones for review. > > > > thanks, > > > > David > > > > > > > > > > > > > > > > On Tue, Sep 8, 2015 at 3:47 PM, Xinliang David Li <davidxl at google.com> > > wrote: > >> > >> >> >> > >> >> >> yes -- it is fixed length (8byte) blob which may include null byte > >> >> >> in > >> >> >> the middle. > >> >> > > >> >> > > >> >> > For reference, MD5 sum is 16 bytes (128-bit): > >> >> > https://en.wikipedia.org/wiki/MD5 > >> >> > >> >> yes, LLVM's MD5 hash only takes the lower 64bit. > >> >> > >> >> > >> >> > > >> >> >> > >> >> >> > >> >> >> > > >> >> >> > Or to say it another way, suppose that Itanium mangling required > >> >> >> > as a > >> >> >> > final > >> >> >> > step to replace the string with its md5 sum in hex. Therefore > all > >> >> >> > symbol > >> >> >> > names are "small". My understanding is that this is effectively > >> >> >> > all > >> >> >> > your > >> >> >> > patch is doing. > >> >> >> > >> >> >> The key type before the change is StringRef, while the after the > key > >> >> >> type is uint64_t. Are you suggesting treating uint64_t md5 sum key > >> >> >> as > >> >> >> a string of 8 bytes or storing md5 has in text form which will > >> >> >> double > >> >> >> the size? > >> >> > > >> >> > > >> >> > How much does this change the benefit? If most of the benefit is > >> >> > avoiding > >> >> > extraordinarily long mangled names then it may be sufficient. > >> >> > > >> >> > With IR-level instrumentation like Rong is pursuing the size may be > >> >> > reduced > >> >> > sufficiently that we do not need the optimization proposed in this > >> >> > thread. > >> >> > For example, Rong found >2x size reduction on Google's C++ > >> >> > benchmarks, > >> >> > which > >> >> > I assume are representative of the extremely large Google binaries > >> >> > that > >> >> > are > >> >> > causing the problems addressed by your proposal in this thread. The > >> >> > measurements you mention for Clang in this thread provide similar > >> >> > size > >> >> > reductions, so Rong's approach may be sufficient (especially > because > >> >> > functions with extremely large mangled names tend to be small > inline > >> >> > functions in header-only template libraries). > >> >> > >> >> Late instrumentation helps many cases. In some cases (as shown in > >> >> SPEC), the reduction in size is not as large. Reducing PGO overhead > >> >> will lower the bar for its adoption. > >> >> > >> >> > > >> >> > Of the points you mention in "Large size of overhead can limit the > >> >> > usability > >> >> > of PGO greatly", many of the issues are hard limits that prevent > the > >> >> > use > >> >> > of > >> >> > PGO. Do you have a lower bound on how much the size of the PGO data > >> >> > must > >> >> > be > >> >> > reduced in order to overcome the hard limits? > >> >> > >> >> This is a static view: Think about the situation where application > >> >> size is ever increasing; also think about situation where we want to > >> >> > >> >> collect more types of profile data. Think about situation where user > >> >> want to run pgo binaries on small devices with tiny memory/storage .. > >> > > >> > > >> > If we want to reduce memory overhead at runtime and reduce the size of > >> > the > >> > raw profile data extracted from the target, there are clear solutions. > >> > Consider that debug info does not need to be loaded into the memory > >> > image of > >> > the target; why should information identifying each counter need to > be? > >> > A file containing raw profile counters is a subset of a core dump; in > >> > most > >> > environments, a core dump does not need to have debug info or symbol > >> > names > >> > in it, but can be still be read in full detail in conjunction with the > >> > original binary. > >> > >> Yes -- there are many alternatives: > >> 1) emit the name key mapping as a side data at compile time, or > >> 2) emit them into nonloadable sections of the object file. > >> > >> Compared with the above, LLVM's existing design does have its own > >> advantage -- making it easier for tool to access 'debug' info for > >> counters. > >> > >> LLVM's coverage testing, on the other hand, take a hybrid approach: It > >> emits the coverage map as rodata, but does not pass it to the profile > >> dumper. I think it is better to emit covmap as a side data not > >> attached to target binary. > >> > >> > >> > > >> > Thus, as we require that the binary be passed to llvm-profdata, there > is > >> > no > >> > fundamental reason that the memory image of the program, or the raw > data > >> > extracted from the program, must have any size overhead besides the > raw > >> > values of the counters themselves and any text size increase for > >> > incrementing them. If we are willing to impose this requirement on > >> > users, > >> > then as far as reducing memory overhead at runtime and reducing the > size > >> > of > >> > the raw profile data extracted from the target, using hashed function > >> > names > >> > is clearly the wrong direction. > >> > > >> > *Without* imposing the requirement of passing the binary to > >> > llvm-profdata, I > >> > do like the ability to use hashed function names like you are > proposing. > >> > It > >> > is a simple solution for reducing size overhead of function name > strings > >> > with little complexity, as it is just swapping one string for another. > >> > >> Agree. The good news is that the overhead of hashed function names is > >> small enough that makes this approach attractive. > >> > >> thanks, > >> > >> David > >> > > >> >> > >> >> > >> >> > > >> >> > Obviously LLVM must be able to support the extremely large binaries > >> >> > in > >> >> > your > >> >> > configuration (otherwise what use is LLVM as a compiler ;) My > >> >> > questions > >> >> > are > >> >> > primarily aimed at establishing which tradeoffs are acceptable for > >> >> > supporting this (both for LLVM and for you guys). > >> >> > >> >> As I said, with the modified proposal (after getting your feedback), > >> >> no PGO users in LLVM land is going to lose anything/functionality. > The > >> >> end result will be net win for general users of LLVM (even though > your > >> >> customers don't care about it), not just 'us' as you have mentioned > >> >> many times. > >> >> > >> >> > > >> >> > Btw, for us, the issue of PGO data size is not completely > immaterial > >> >> > but > >> >> > is > >> >> > very different from your use case. For us, the primary issue is the > >> >> > additional memory use at run time, since PS4 games usually use > "all" > >> >> > available memory. We had a problem with UBSan where the large > amount > >> >> > of > >> >> > memory required for storing the UBSan diagnostic data at runtime > >> >> > required > >> >> > the game programmers to manually change their memory map to make > >> >> > room. > >> >> > +Filipe, do you remember how much memory UBSan was using that > caused > >> >> > a > >> >> > problem? > >> >> > > >> >> > >> >> My proposal does help reducing rodata size significantly. > >> > > >> > > >> > Yes, that is why I think that this is a useful thing to do. I just > want > >> > to > >> > be careful about existing use cases and the relevant workflow issues. > >> > > >> > -- Sean Silva > >> > > >> >> > >> >> > >> >> David > >> >> > >> >> > >> >> > -- Sean Silva > >> >> > > >> >> >> > >> >> >> > >> >> >> In the raw format, md5 sum key can be an embedded field in the > >> >> >> prf_data variable instead of as different var referenced by > >> >> >> prf_data. > >> >> >> > >> >> >> > > >> >> >> > If this is not the case, you should show your current patch so > >> >> >> > that > >> >> >> > we > >> >> >> > can > >> >> >> > discuss things concretely. > >> >> >> > >> >> >> It is not. See above about the difference. > >> >> >> > >> >> >> > > >> >> >> >> > >> >> >> >> > > >> >> >> >> > > >> >> >> >> >> > >> >> >> >> >> It will be > >> >> >> >> >> very messy to support multiple formats in instr-codegen and > >> >> >> >> >> instr-runtime. For compatibility concerns, the reader is > >> >> >> >> >> taught > >> >> >> >> >> to > >> >> >> >> >> support previous format, but the changes there are isolated > >> >> >> >> >> (also > >> >> >> >> >> expected to be removed in the future). > >> >> >> >> >> > >> >> >> >> >> > > >> >> >> >> >> > My primary concern is that if the function name are not > kept > >> >> >> >> >> > at > >> >> >> >> >> > all > >> >> >> >> >> > stages, > >> >> >> >> >> > then it becomes difficult to analyze the profile data in a > >> >> >> >> >> > standalone > >> >> >> >> >> > way. > >> >> >> >> >> > Many times, I have used `llvm-profdata show -all-functions > >> >> >> >> >> > foo.profdata` > >> >> >> >> >> > on > >> >> >> >> >> > the resulting profile data and then imported that data > into > >> >> >> >> >> > Mathematica > >> >> >> >> >> > for > >> >> >> >> >> > analysis. > >> >> >> >> >> > >> >> >> >> >> This is certainly a very valid use case. > >> >> >> >> >> > >> >> >> >> >> >My understanding of your proposal is that `llvm-profdata > show > >> >> >> >> >> > -all-functions foo.profdata` will not show the actual > >> >> >> >> >> > function > >> >> >> >> >> > names > >> >> >> >> >> > but > >> >> >> >> >> > instead MD5 hashes, > >> >> >> >> >> > >> >> >> >> >> Yes. > >> >> >> >> >> > >> >> >> >> >> To support your use case, there are two solutions: > >> >> >> >> >> > >> >> >> >> >> 1) user can add -fcoverage-mapping option in the build > >> >> >> >> >> 2) introduce a new option -fprofile-instr-names that force > the > >> >> >> >> >> emission of the name sections in the .o file. This is > similar > >> >> >> >> >> to > >> >> >> >> >> 1), > >> >> >> >> >> but no covmap section is needed. > >> >> >> >> >> > >> >> >> >> >> llvm-profdata tool will be taught to read the name section > >> >> >> >> >> and > >> >> >> >> >> attach > >> >> >> >> >> function names to the profile records. > >> >> >> >> > > >> >> >> >> > > >> >> >> >> > Needing to pass the executable to llvm-profdata would cause > >> >> >> >> > deployment > >> >> >> >> > issues for my customers in practice. > >> >> >> >> > >> >> >> >> Why? The deployment needs to pass the profile data anyway > right? > >> >> >> > > >> >> >> > > >> >> >> > Yes, but not the executable. > >> >> >> > > >> >> >> > The PGO training run is likely being run by a gameplay tester > >> >> >> > (non-programmer). In general the binary will not be lying around > >> >> >> > as a > >> >> >> > loose > >> >> >> > file anywhere, it will be part of a full package of the > >> >> >> > binary+assets > >> >> >> > (think > >> >> >> > like what will end up on a bluray disc). A game's binary > >> >> >> > *completely > >> >> >> > useless* without the assets, so except locally on a programmer's > >> >> >> > machine > >> >> >> > while they iterate/debug, there is no reason for a binary to > ever > >> >> >> > exist > >> >> >> > as a > >> >> >> > standalone file. > >> >> >> > > >> >> >> > I'm not saying that needing the binary is insurmountable in any > >> >> >> > particular > >> >> >> > scenario. Just that it will cause a strict increase in the > number > >> >> >> > of > >> >> >> > issues > >> >> >> > to deploying PGO. > >> >> >> > >> >> >> Your concern is acknowledged. > >> >> >> > >> >> >> > > >> >> >> > These are much bigger "compatibility concerns" for me than for > >> >> >> > newer > >> >> >> > toolchains to accept the old format. For a change in format I > can > >> >> >> > easily > >> >> >> > tell my users to replace an exe with a newer one and that is all > >> >> >> > they > >> >> >> > need > >> >> >> > to do and it takes 10 seconds, guaranteed. A workflow change is > >> >> >> > potentially > >> >> >> > a massive disruption and guaranteed to take more than 10 seconds > >> >> >> > to > >> >> >> > fix > >> >> >> > (perhaps hours or days). > >> >> >> > >> >> >> ok. > >> >> >> > >> >> >> > > >> >> >> > > >> >> >> >> > >> >> >> >> This > >> >> >> >> is no different from llvm-cov usage model. > >> >> >> > > >> >> >> > > >> >> >> > In practice, getting the performance of PGO is a higher priority > >> >> >> > for > >> >> >> > my > >> >> >> > users, so we should not assume that llvm-cov is being used. > >> >> >> > >> >> >> Glad to hear that :) > >> >> >> > >> >> >> thanks, > >> >> >> > >> >> >> David > >> >> >> > >> >> >> > > >> >> >> > -- Sean Silva > >> >> >> > > >> >> >> >> > >> >> >> >> > >> >> >> >> David > >> >> >> >> > >> >> >> >> > >> >> >> >> > >> >> >> >> > > >> >> >> >> >> > >> >> >> >> >> > >> >> >> >> >> Note that with 1) or 2), the user can still benefit from the > >> >> >> >> >> reduced > >> >> >> >> >> profile size. > >> >> >> >> > > >> >> >> >> > > >> >> >> >> > Let me reiterate that the size of the profile is not a > problem > >> >> >> >> > I > >> >> >> >> > have > >> >> >> >> > observed in practice (nor have I heard of this being a > problem > >> >> >> >> > in > >> >> >> >> > practice > >> >> >> >> > until this thread). Therefore I'm skeptical of any changes to > >> >> >> >> > our > >> >> >> >> > default > >> >> >> >> > behavior or any new requirements that are not opt-in. > >> >> >> >> > > >> >> >> >> > -- Sean Silva > >> >> >> >> > > >> >> >> >> >> > >> >> >> >> >> > >> >> >> >> >> thanks, > >> >> >> >> >> > >> >> >> >> >> David > >> >> >> >> >> > >> >> >> >> >> > >> >> >> >> >> > >> >> >> >> >> > >> >> >> >> >> >which will make it more difficult for me to do this kind > >> >> >> >> >> > of analysis (would require using nm on the original > binary, > >> >> >> >> >> > hashing > >> >> >> >> >> > everything, etc.). > >> >> >> >> >> > > >> >> >> >> >> > btw, feel free to attach the patch even if it in a rough > >> >> >> >> >> > state. > >> >> >> >> >> > It > >> >> >> >> >> > can > >> >> >> >> >> > still > >> >> >> >> >> > help to clarify the proposal and be a good talking point. > >> >> >> >> >> > Fine-grained > >> >> >> >> >> > patch > >> >> >> >> >> > review for caring about the rough parts will happen on > >> >> >> >> >> > llvm-commits; > >> >> >> >> >> > the > >> >> >> >> >> > rough parts will not distract the discussion here on > >> >> >> >> >> > llvm-dev. > >> >> >> >> >> > > >> >> >> >> >> > -- Sean Silva > >> >> >> >> >> > > >> >> >> >> >> >> > >> >> >> >> >> >> > >> >> >> >> >> >> thanks, > >> >> >> >> >> >> > >> >> >> >> >> >> David > >> >> >> >> >> >> _______________________________________________ > >> >> >> >> >> >> LLVM Developers mailing list > >> >> >> >> >> >> llvm-dev at lists.llvm.org > >> >> >> >> >> >> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev > >> >> >> >> >> > > >> >> >> >> >> > > >> >> >> >> > > >> >> >> >> > > >> >> >> > > >> >> >> > > >> >> > > >> >> > > >> > > >> > > > > > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20151009/90eb4251/attachment.html>
Xinliang David Li via llvm-dev
2015-Oct-09 23:06 UTC
[llvm-dev] RFC: Reducing Instr PGO size overhead
On Fri, Oct 9, 2015 at 3:58 PM, Sean Silva <chisophugis at gmail.com> wrote:> > > On Wed, Oct 7, 2015 at 11:12 PM, Xinliang David Li <davidxl at google.com> > wrote: >> >> There is no further response to this, so I will assume general >> direction of solution-3 is acceptable ;)> No response does not mean "LGTM". >What I meant is that the discussion can be moved on to the formal code review. I have not yet submitted the final patch for review yet. Before that is ready, continue using this thread to voice your concerns. David> -- Sean Silva > >> >> >> Solution-3 can be further improved. Instead of using static symbol >> table (with zero size __llvm_prf_nm symbols) to store function names >> for profile display and coverage mapping, the function names can be >> stored compressed in a non-allocatable section. The compression ratio >> for function name strings can be very high (~5x). The covmapping data >> can also be made non-allocatable. >> >> thanks, >> >> David >> >> On Tue, Sep 29, 2015 at 9:52 AM, Xinliang David Li <davidxl at google.com> >> wrote: >> > Sorry for the late update. I finally found time to implement a solution >> > (Solution-3) that has the best size savings (for both PGO and coverage >> > testing) with symbolic information available. Here is a re-cap of what >> > we >> > have discussed so far: >> > >> > Solution-1: >> > >> > This is the original solution proposed. In this solution, A function >> > name's >> > MD5 hash is used as the primary key (combined with IR hash) for function >> > lookup. __llvm_prf_names section won't be emitted into the binary nor >> > dumped >> > into the profile data unless -fcoverage-mapping is also specified. >> > >> > Pros: >> > 1. maximal reduction of instrumentation binary process image >> > size >> > 2. maximal reduction of object and unstripped binary size >> > 3. maximal reduction of raw profile data size >> > 4. maximal reduction of indexed profile data size >> > >> > Cons: >> > 1. -fcoverage-mapping use case is not handled -- the size >> > problem >> > still exist >> > 2. profile dump with llvm-profdata no longer have function names >> > associated -- user needs to use postprocessing tool to get the >> > functionality >> > 3. function filtering with partial function name is not >> > supported >> > 4. Requires incompatible format change >> > >> > >> > Solution-2: (http://reviews.llvm.org/D12715) >> > >> > In this solution, the MD5 hash string is used to replace the raw name >> > string >> > Pros: >> > 1. Very simple to implement >> > 2. have good reduction of all sizes for typical large C++ >> > applications >> > 3. No profile data format change is required. >> > >> > Cons: >> > 1. Still requires 16byte overhead per-function -- can actually >> > hurt C programs >> > 2. -fcoverage-mapping use case is still not handled >> > 3. The problem with llvm-profdata still exists (no symbolic >> > info, >> > partial filtering support) >> > >> > >> > Solution-3: >> > >> > This is the new solution I am proposing. It is basically an enhancement >> > of >> > Solution-1 with most of the weakness resolved. The difference with >> > Solution-1 is >> > 1. Function name symbols are emitted into the symbol table as weak >> > externs. They don't occupy any space at runtime and can be easily >> > stripped. >> > 2. -fcoverage-mapping does not need special handling -- it >> > automatically benefit from the same size saving. >> > 3. llvm-cov is changed to read symbol info from the symtab instead >> > of >> > reading them from the name section data >> > 4. llvm-profdata is enhanced to take a binary as input and dump >> > profile with names attached. Function filtering is fully supported >> > (option >> > can also be introduced to force dumping names into binary and profile >> > data, >> > so that llvm-profdata use case is not changed at all). >> > >> > Pros: >> > 1. All the pros from Solution-1 >> > 2. Size savings for coverage-mapping case >> > Cons: >> > Format change is required for profile data and coverage mapping. >> > >> > The initial patch is here: http://reviews.llvm.org/D13251 >> > >> > With this patch, the size of a release clang binary with coverage >> > mapping is >> > reduced from 986M to 569M. >> > >> > If there are no major concerns, I will carve out the patch into smaller >> > ones for review. >> > >> > thanks, >> > >> > David >> > >> > >> > >> > >> > >> > >> > >> > On Tue, Sep 8, 2015 at 3:47 PM, Xinliang David Li <davidxl at google.com> >> > wrote: >> >> >> >> >> >> >> >> >> >> yes -- it is fixed length (8byte) blob which may include null >> >> >> >> byte >> >> >> >> in >> >> >> >> the middle. >> >> >> > >> >> >> > >> >> >> > For reference, MD5 sum is 16 bytes (128-bit): >> >> >> > https://en.wikipedia.org/wiki/MD5 >> >> >> >> >> >> yes, LLVM's MD5 hash only takes the lower 64bit. >> >> >> >> >> >> >> >> >> > >> >> >> >> >> >> >> >> >> >> >> >> > >> >> >> >> > Or to say it another way, suppose that Itanium mangling >> >> >> >> > required >> >> >> >> > as a >> >> >> >> > final >> >> >> >> > step to replace the string with its md5 sum in hex. Therefore >> >> >> >> > all >> >> >> >> > symbol >> >> >> >> > names are "small". My understanding is that this is effectively >> >> >> >> > all >> >> >> >> > your >> >> >> >> > patch is doing. >> >> >> >> >> >> >> >> The key type before the change is StringRef, while the after the >> >> >> >> key >> >> >> >> type is uint64_t. Are you suggesting treating uint64_t md5 sum >> >> >> >> key >> >> >> >> as >> >> >> >> a string of 8 bytes or storing md5 has in text form which will >> >> >> >> double >> >> >> >> the size? >> >> >> > >> >> >> > >> >> >> > How much does this change the benefit? If most of the benefit is >> >> >> > avoiding >> >> >> > extraordinarily long mangled names then it may be sufficient. >> >> >> > >> >> >> > With IR-level instrumentation like Rong is pursuing the size may >> >> >> > be >> >> >> > reduced >> >> >> > sufficiently that we do not need the optimization proposed in this >> >> >> > thread. >> >> >> > For example, Rong found >2x size reduction on Google's C++ >> >> >> > benchmarks, >> >> >> > which >> >> >> > I assume are representative of the extremely large Google binaries >> >> >> > that >> >> >> > are >> >> >> > causing the problems addressed by your proposal in this thread. >> >> >> > The >> >> >> > measurements you mention for Clang in this thread provide similar >> >> >> > size >> >> >> > reductions, so Rong's approach may be sufficient (especially >> >> >> > because >> >> >> > functions with extremely large mangled names tend to be small >> >> >> > inline >> >> >> > functions in header-only template libraries). >> >> >> >> >> >> Late instrumentation helps many cases. In some cases (as shown in >> >> >> SPEC), the reduction in size is not as large. Reducing PGO overhead >> >> >> will lower the bar for its adoption. >> >> >> >> >> >> > >> >> >> > Of the points you mention in "Large size of overhead can limit the >> >> >> > usability >> >> >> > of PGO greatly", many of the issues are hard limits that prevent >> >> >> > the >> >> >> > use >> >> >> > of >> >> >> > PGO. Do you have a lower bound on how much the size of the PGO >> >> >> > data >> >> >> > must >> >> >> > be >> >> >> > reduced in order to overcome the hard limits? >> >> >> >> >> >> This is a static view: Think about the situation where application >> >> >> size is ever increasing; also think about situation where we want to >> >> >> >> >> >> collect more types of profile data. Think about situation where user >> >> >> want to run pgo binaries on small devices with tiny memory/storage >> >> >> .. >> >> > >> >> > >> >> > If we want to reduce memory overhead at runtime and reduce the size >> >> > of >> >> > the >> >> > raw profile data extracted from the target, there are clear >> >> > solutions. >> >> > Consider that debug info does not need to be loaded into the memory >> >> > image of >> >> > the target; why should information identifying each counter need to >> >> > be? >> >> > A file containing raw profile counters is a subset of a core dump; in >> >> > most >> >> > environments, a core dump does not need to have debug info or symbol >> >> > names >> >> > in it, but can be still be read in full detail in conjunction with >> >> > the >> >> > original binary. >> >> >> >> Yes -- there are many alternatives: >> >> 1) emit the name key mapping as a side data at compile time, or >> >> 2) emit them into nonloadable sections of the object file. >> >> >> >> Compared with the above, LLVM's existing design does have its own >> >> advantage -- making it easier for tool to access 'debug' info for >> >> counters. >> >> >> >> LLVM's coverage testing, on the other hand, take a hybrid approach: It >> >> emits the coverage map as rodata, but does not pass it to the profile >> >> dumper. I think it is better to emit covmap as a side data not >> >> attached to target binary. >> >> >> >> >> >> > >> >> > Thus, as we require that the binary be passed to llvm-profdata, there >> >> > is >> >> > no >> >> > fundamental reason that the memory image of the program, or the raw >> >> > data >> >> > extracted from the program, must have any size overhead besides the >> >> > raw >> >> > values of the counters themselves and any text size increase for >> >> > incrementing them. If we are willing to impose this requirement on >> >> > users, >> >> > then as far as reducing memory overhead at runtime and reducing the >> >> > size >> >> > of >> >> > the raw profile data extracted from the target, using hashed function >> >> > names >> >> > is clearly the wrong direction. >> >> > >> >> > *Without* imposing the requirement of passing the binary to >> >> > llvm-profdata, I >> >> > do like the ability to use hashed function names like you are >> >> > proposing. >> >> > It >> >> > is a simple solution for reducing size overhead of function name >> >> > strings >> >> > with little complexity, as it is just swapping one string for >> >> > another. >> >> >> >> Agree. The good news is that the overhead of hashed function names is >> >> small enough that makes this approach attractive. >> >> >> >> thanks, >> >> >> >> David >> >> > >> >> >> >> >> >> >> >> >> > >> >> >> > Obviously LLVM must be able to support the extremely large >> >> >> > binaries >> >> >> > in >> >> >> > your >> >> >> > configuration (otherwise what use is LLVM as a compiler ;) My >> >> >> > questions >> >> >> > are >> >> >> > primarily aimed at establishing which tradeoffs are acceptable for >> >> >> > supporting this (both for LLVM and for you guys). >> >> >> >> >> >> As I said, with the modified proposal (after getting your feedback), >> >> >> no PGO users in LLVM land is going to lose anything/functionality. >> >> >> The >> >> >> end result will be net win for general users of LLVM (even though >> >> >> your >> >> >> customers don't care about it), not just 'us' as you have mentioned >> >> >> many times. >> >> >> >> >> >> > >> >> >> > Btw, for us, the issue of PGO data size is not completely >> >> >> > immaterial >> >> >> > but >> >> >> > is >> >> >> > very different from your use case. For us, the primary issue is >> >> >> > the >> >> >> > additional memory use at run time, since PS4 games usually use >> >> >> > "all" >> >> >> > available memory. We had a problem with UBSan where the large >> >> >> > amount >> >> >> > of >> >> >> > memory required for storing the UBSan diagnostic data at runtime >> >> >> > required >> >> >> > the game programmers to manually change their memory map to make >> >> >> > room. >> >> >> > +Filipe, do you remember how much memory UBSan was using that >> >> >> > caused >> >> >> > a >> >> >> > problem? >> >> >> > >> >> >> >> >> >> My proposal does help reducing rodata size significantly. >> >> > >> >> > >> >> > Yes, that is why I think that this is a useful thing to do. I just >> >> > want >> >> > to >> >> > be careful about existing use cases and the relevant workflow issues. >> >> > >> >> > -- Sean Silva >> >> > >> >> >> >> >> >> >> >> >> David >> >> >> >> >> >> >> >> >> > -- Sean Silva >> >> >> > >> >> >> >> >> >> >> >> >> >> >> >> In the raw format, md5 sum key can be an embedded field in the >> >> >> >> prf_data variable instead of as different var referenced by >> >> >> >> prf_data. >> >> >> >> >> >> >> >> > >> >> >> >> > If this is not the case, you should show your current patch so >> >> >> >> > that >> >> >> >> > we >> >> >> >> > can >> >> >> >> > discuss things concretely. >> >> >> >> >> >> >> >> It is not. See above about the difference. >> >> >> >> >> >> >> >> > >> >> >> >> >> >> >> >> >> >> > >> >> >> >> >> > >> >> >> >> >> >> >> >> >> >> >> >> It will be >> >> >> >> >> >> very messy to support multiple formats in instr-codegen and >> >> >> >> >> >> instr-runtime. For compatibility concerns, the reader is >> >> >> >> >> >> taught >> >> >> >> >> >> to >> >> >> >> >> >> support previous format, but the changes there are isolated >> >> >> >> >> >> (also >> >> >> >> >> >> expected to be removed in the future). >> >> >> >> >> >> >> >> >> >> >> >> > >> >> >> >> >> >> > My primary concern is that if the function name are not >> >> >> >> >> >> > kept >> >> >> >> >> >> > at >> >> >> >> >> >> > all >> >> >> >> >> >> > stages, >> >> >> >> >> >> > then it becomes difficult to analyze the profile data in >> >> >> >> >> >> > a >> >> >> >> >> >> > standalone >> >> >> >> >> >> > way. >> >> >> >> >> >> > Many times, I have used `llvm-profdata show >> >> >> >> >> >> > -all-functions >> >> >> >> >> >> > foo.profdata` >> >> >> >> >> >> > on >> >> >> >> >> >> > the resulting profile data and then imported that data >> >> >> >> >> >> > into >> >> >> >> >> >> > Mathematica >> >> >> >> >> >> > for >> >> >> >> >> >> > analysis. >> >> >> >> >> >> >> >> >> >> >> >> This is certainly a very valid use case. >> >> >> >> >> >> >> >> >> >> >> >> >My understanding of your proposal is that `llvm-profdata >> >> >> >> >> >> > show >> >> >> >> >> >> > -all-functions foo.profdata` will not show the actual >> >> >> >> >> >> > function >> >> >> >> >> >> > names >> >> >> >> >> >> > but >> >> >> >> >> >> > instead MD5 hashes, >> >> >> >> >> >> >> >> >> >> >> >> Yes. >> >> >> >> >> >> >> >> >> >> >> >> To support your use case, there are two solutions: >> >> >> >> >> >> >> >> >> >> >> >> 1) user can add -fcoverage-mapping option in the build >> >> >> >> >> >> 2) introduce a new option -fprofile-instr-names that force >> >> >> >> >> >> the >> >> >> >> >> >> emission of the name sections in the .o file. This is >> >> >> >> >> >> similar >> >> >> >> >> >> to >> >> >> >> >> >> 1), >> >> >> >> >> >> but no covmap section is needed. >> >> >> >> >> >> >> >> >> >> >> >> llvm-profdata tool will be taught to read the name section >> >> >> >> >> >> and >> >> >> >> >> >> attach >> >> >> >> >> >> function names to the profile records. >> >> >> >> >> > >> >> >> >> >> > >> >> >> >> >> > Needing to pass the executable to llvm-profdata would cause >> >> >> >> >> > deployment >> >> >> >> >> > issues for my customers in practice. >> >> >> >> >> >> >> >> >> >> Why? The deployment needs to pass the profile data anyway >> >> >> >> >> right? >> >> >> >> > >> >> >> >> > >> >> >> >> > Yes, but not the executable. >> >> >> >> > >> >> >> >> > The PGO training run is likely being run by a gameplay tester >> >> >> >> > (non-programmer). In general the binary will not be lying >> >> >> >> > around >> >> >> >> > as a >> >> >> >> > loose >> >> >> >> > file anywhere, it will be part of a full package of the >> >> >> >> > binary+assets >> >> >> >> > (think >> >> >> >> > like what will end up on a bluray disc). A game's binary >> >> >> >> > *completely >> >> >> >> > useless* without the assets, so except locally on a >> >> >> >> > programmer's >> >> >> >> > machine >> >> >> >> > while they iterate/debug, there is no reason for a binary to >> >> >> >> > ever >> >> >> >> > exist >> >> >> >> > as a >> >> >> >> > standalone file. >> >> >> >> > >> >> >> >> > I'm not saying that needing the binary is insurmountable in any >> >> >> >> > particular >> >> >> >> > scenario. Just that it will cause a strict increase in the >> >> >> >> > number >> >> >> >> > of >> >> >> >> > issues >> >> >> >> > to deploying PGO. >> >> >> >> >> >> >> >> Your concern is acknowledged. >> >> >> >> >> >> >> >> > >> >> >> >> > These are much bigger "compatibility concerns" for me than for >> >> >> >> > newer >> >> >> >> > toolchains to accept the old format. For a change in format I >> >> >> >> > can >> >> >> >> > easily >> >> >> >> > tell my users to replace an exe with a newer one and that is >> >> >> >> > all >> >> >> >> > they >> >> >> >> > need >> >> >> >> > to do and it takes 10 seconds, guaranteed. A workflow change is >> >> >> >> > potentially >> >> >> >> > a massive disruption and guaranteed to take more than 10 >> >> >> >> > seconds >> >> >> >> > to >> >> >> >> > fix >> >> >> >> > (perhaps hours or days). >> >> >> >> >> >> >> >> ok. >> >> >> >> >> >> >> >> > >> >> >> >> > >> >> >> >> >> >> >> >> >> >> This >> >> >> >> >> is no different from llvm-cov usage model. >> >> >> >> > >> >> >> >> > >> >> >> >> > In practice, getting the performance of PGO is a higher >> >> >> >> > priority >> >> >> >> > for >> >> >> >> > my >> >> >> >> > users, so we should not assume that llvm-cov is being used. >> >> >> >> >> >> >> >> Glad to hear that :) >> >> >> >> >> >> >> >> thanks, >> >> >> >> >> >> >> >> David >> >> >> >> >> >> >> >> > >> >> >> >> > -- Sean Silva >> >> >> >> > >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> David >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> > >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> Note that with 1) or 2), the user can still benefit from >> >> >> >> >> >> the >> >> >> >> >> >> reduced >> >> >> >> >> >> profile size. >> >> >> >> >> > >> >> >> >> >> > >> >> >> >> >> > Let me reiterate that the size of the profile is not a >> >> >> >> >> > problem >> >> >> >> >> > I >> >> >> >> >> > have >> >> >> >> >> > observed in practice (nor have I heard of this being a >> >> >> >> >> > problem >> >> >> >> >> > in >> >> >> >> >> > practice >> >> >> >> >> > until this thread). Therefore I'm skeptical of any changes >> >> >> >> >> > to >> >> >> >> >> > our >> >> >> >> >> > default >> >> >> >> >> > behavior or any new requirements that are not opt-in. >> >> >> >> >> > >> >> >> >> >> > -- Sean Silva >> >> >> >> >> > >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> thanks, >> >> >> >> >> >> >> >> >> >> >> >> David >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >which will make it more difficult for me to do this kind >> >> >> >> >> >> > of analysis (would require using nm on the original >> >> >> >> >> >> > binary, >> >> >> >> >> >> > hashing >> >> >> >> >> >> > everything, etc.). >> >> >> >> >> >> > >> >> >> >> >> >> > btw, feel free to attach the patch even if it in a rough >> >> >> >> >> >> > state. >> >> >> >> >> >> > It >> >> >> >> >> >> > can >> >> >> >> >> >> > still >> >> >> >> >> >> > help to clarify the proposal and be a good talking point. >> >> >> >> >> >> > Fine-grained >> >> >> >> >> >> > patch >> >> >> >> >> >> > review for caring about the rough parts will happen on >> >> >> >> >> >> > llvm-commits; >> >> >> >> >> >> > the >> >> >> >> >> >> > rough parts will not distract the discussion here on >> >> >> >> >> >> > llvm-dev. >> >> >> >> >> >> > >> >> >> >> >> >> > -- Sean Silva >> >> >> >> >> >> > >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> thanks, >> >> >> >> >> >> >> >> >> >> >> >> >> >> David >> >> >> >> >> >> >> _______________________________________________ >> >> >> >> >> >> >> LLVM Developers mailing list >> >> >> >> >> >> >> llvm-dev at lists.llvm.org >> >> >> >> >> >> >> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >> >> >> >> >> >> > >> >> >> >> >> >> > >> >> >> >> >> > >> >> >> >> >> > >> >> >> >> > >> >> >> >> > >> >> >> > >> >> >> > >> >> > >> >> > >> > >> > > >
Bob Wilson via llvm-dev
2015-Dec-14 16:55 UTC
[llvm-dev] RFC: Reducing Instr PGO size overhead
> On Dec 9, 2015, at 1:12 PM, Xinliang David Li via llvm-dev <llvm-dev at lists.llvm.org> wrote: > > We are now very close to push the final stage of the PGO size > reduction task. Here is the updated plan going forward: > > 1) In this round, the format of the indexed profile data won't be unchanged.That’s a double negative. Was that intentional? I.E., are you changing the format?> 2) there will be *no* changes in user interfaces to all profiling > related tools including llvm-profdata, llvm-cov -- the change will be > transparent in terms of PGO usability. > 3) The implementation will be using compression for the function name > section (the compression ratio is about 5:1). As a result, the linker > input object size, unstripped binary size, stripped binary size, > process image size, and raw profile data size will all be greatly > reduced; > 4) The change will also greatly improve usability of coverage-mapping > due to the reduced data size in memory. > > Before the final patch, there are a few more small preparation patches > : 1) abstract naming reading into a class (ProfSymtab) (currently the > reader uses/assumes the raw/uncompressed object section. 2) add > compression support in ProfSymtab. > > thanks, > > David > > > > > > On Wed, Oct 14, 2015 at 12:30 AM, Xinliang David Li <davidxl at google.com> wrote: >> I plan to divide the patch into series of small patches. The >> preparation patches will mostly be refactoring changes with NFC. This >> will minimize the size of final patch(es) with functional changes to >> help discussions. >> >> thanks, >> >> David >> >> On Fri, Oct 9, 2015 at 4:06 PM, Xinliang David Li <davidxl at google.com> wrote: >>> On Fri, Oct 9, 2015 at 3:58 PM, Sean Silva <chisophugis at gmail.com> wrote: >>>> >>>> >>>> On Wed, Oct 7, 2015 at 11:12 PM, Xinliang David Li <davidxl at google.com> >>>> wrote: >>>>> >>>>> There is no further response to this, so I will assume general >>>>> direction of solution-3 is acceptable ;) >>> >>>> No response does not mean "LGTM". >>>> >>> >>> What I meant is that the discussion can be moved on to the formal code >>> review. I have not yet submitted the final patch for review yet. >>> Before that is ready, continue using this thread to voice your >>> concerns. >>> >>> David >>> >>> >>> >>> >>>> -- Sean Silva >>>> >>>>> >>>>> >>>>> Solution-3 can be further improved. Instead of using static symbol >>>>> table (with zero size __llvm_prf_nm symbols) to store function names >>>>> for profile display and coverage mapping, the function names can be >>>>> stored compressed in a non-allocatable section. The compression ratio >>>>> for function name strings can be very high (~5x). The covmapping data >>>>> can also be made non-allocatable. >>>>> >>>>> thanks, >>>>> >>>>> David >>>>> >>>>> On Tue, Sep 29, 2015 at 9:52 AM, Xinliang David Li <davidxl at google.com> >>>>> wrote: >>>>>> Sorry for the late update. I finally found time to implement a solution >>>>>> (Solution-3) that has the best size savings (for both PGO and coverage >>>>>> testing) with symbolic information available. Here is a re-cap of what >>>>>> we >>>>>> have discussed so far: >>>>>> >>>>>> Solution-1: >>>>>> >>>>>> This is the original solution proposed. In this solution, A function >>>>>> name's >>>>>> MD5 hash is used as the primary key (combined with IR hash) for function >>>>>> lookup. __llvm_prf_names section won't be emitted into the binary nor >>>>>> dumped >>>>>> into the profile data unless -fcoverage-mapping is also specified. >>>>>> >>>>>> Pros: >>>>>> 1. maximal reduction of instrumentation binary process image >>>>>> size >>>>>> 2. maximal reduction of object and unstripped binary size >>>>>> 3. maximal reduction of raw profile data size >>>>>> 4. maximal reduction of indexed profile data size >>>>>> >>>>>> Cons: >>>>>> 1. -fcoverage-mapping use case is not handled -- the size >>>>>> problem >>>>>> still exist >>>>>> 2. profile dump with llvm-profdata no longer have function names >>>>>> associated -- user needs to use postprocessing tool to get the >>>>>> functionality >>>>>> 3. function filtering with partial function name is not >>>>>> supported >>>>>> 4. Requires incompatible format change >>>>>> >>>>>> >>>>>> Solution-2: (http://reviews.llvm.org/D12715) >>>>>> >>>>>> In this solution, the MD5 hash string is used to replace the raw name >>>>>> string >>>>>> Pros: >>>>>> 1. Very simple to implement >>>>>> 2. have good reduction of all sizes for typical large C++ >>>>>> applications >>>>>> 3. No profile data format change is required. >>>>>> >>>>>> Cons: >>>>>> 1. Still requires 16byte overhead per-function -- can actually >>>>>> hurt C programs >>>>>> 2. -fcoverage-mapping use case is still not handled >>>>>> 3. The problem with llvm-profdata still exists (no symbolic >>>>>> info, >>>>>> partial filtering support) >>>>>> >>>>>> >>>>>> Solution-3: >>>>>> >>>>>> This is the new solution I am proposing. It is basically an enhancement >>>>>> of >>>>>> Solution-1 with most of the weakness resolved. The difference with >>>>>> Solution-1 is >>>>>> 1. Function name symbols are emitted into the symbol table as weak >>>>>> externs. They don't occupy any space at runtime and can be easily >>>>>> stripped. >>>>>> 2. -fcoverage-mapping does not need special handling -- it >>>>>> automatically benefit from the same size saving. >>>>>> 3. llvm-cov is changed to read symbol info from the symtab instead >>>>>> of >>>>>> reading them from the name section data >>>>>> 4. llvm-profdata is enhanced to take a binary as input and dump >>>>>> profile with names attached. Function filtering is fully supported >>>>>> (option >>>>>> can also be introduced to force dumping names into binary and profile >>>>>> data, >>>>>> so that llvm-profdata use case is not changed at all). >>>>>> >>>>>> Pros: >>>>>> 1. All the pros from Solution-1 >>>>>> 2. Size savings for coverage-mapping case >>>>>> Cons: >>>>>> Format change is required for profile data and coverage mapping. >>>>>> >>>>>> The initial patch is here: http://reviews.llvm.org/D13251 >>>>>> >>>>>> With this patch, the size of a release clang binary with coverage >>>>>> mapping is >>>>>> reduced from 986M to 569M. >>>>>> >>>>>> If there are no major concerns, I will carve out the patch into smaller >>>>>> ones for review. >>>>>> >>>>>> thanks, >>>>>> >>>>>> David >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Tue, Sep 8, 2015 at 3:47 PM, Xinliang David Li <davidxl at google.com> >>>>>> wrote: >>>>>>> >>>>>>>>>>> >>>>>>>>>>> yes -- it is fixed length (8byte) blob which may include null >>>>>>>>>>> byte >>>>>>>>>>> in >>>>>>>>>>> the middle. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> For reference, MD5 sum is 16 bytes (128-bit): >>>>>>>>>> https://en.wikipedia.org/wiki/MD5 >>>>>>>>> >>>>>>>>> yes, LLVM's MD5 hash only takes the lower 64bit. >>>>>>>>> >>>>>>>>> >>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Or to say it another way, suppose that Itanium mangling >>>>>>>>>>>> required >>>>>>>>>>>> as a >>>>>>>>>>>> final >>>>>>>>>>>> step to replace the string with its md5 sum in hex. Therefore >>>>>>>>>>>> all >>>>>>>>>>>> symbol >>>>>>>>>>>> names are "small". My understanding is that this is effectively >>>>>>>>>>>> all >>>>>>>>>>>> your >>>>>>>>>>>> patch is doing. >>>>>>>>>>> >>>>>>>>>>> The key type before the change is StringRef, while the after the >>>>>>>>>>> key >>>>>>>>>>> type is uint64_t. Are you suggesting treating uint64_t md5 sum >>>>>>>>>>> key >>>>>>>>>>> as >>>>>>>>>>> a string of 8 bytes or storing md5 has in text form which will >>>>>>>>>>> double >>>>>>>>>>> the size? >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> How much does this change the benefit? If most of the benefit is >>>>>>>>>> avoiding >>>>>>>>>> extraordinarily long mangled names then it may be sufficient. >>>>>>>>>> >>>>>>>>>> With IR-level instrumentation like Rong is pursuing the size may >>>>>>>>>> be >>>>>>>>>> reduced >>>>>>>>>> sufficiently that we do not need the optimization proposed in this >>>>>>>>>> thread. >>>>>>>>>> For example, Rong found >2x size reduction on Google's C++ >>>>>>>>>> benchmarks, >>>>>>>>>> which >>>>>>>>>> I assume are representative of the extremely large Google binaries >>>>>>>>>> that >>>>>>>>>> are >>>>>>>>>> causing the problems addressed by your proposal in this thread. >>>>>>>>>> The >>>>>>>>>> measurements you mention for Clang in this thread provide similar >>>>>>>>>> size >>>>>>>>>> reductions, so Rong's approach may be sufficient (especially >>>>>>>>>> because >>>>>>>>>> functions with extremely large mangled names tend to be small >>>>>>>>>> inline >>>>>>>>>> functions in header-only template libraries). >>>>>>>>> >>>>>>>>> Late instrumentation helps many cases. In some cases (as shown in >>>>>>>>> SPEC), the reduction in size is not as large. Reducing PGO overhead >>>>>>>>> will lower the bar for its adoption. >>>>>>>>> >>>>>>>>>> >>>>>>>>>> Of the points you mention in "Large size of overhead can limit the >>>>>>>>>> usability >>>>>>>>>> of PGO greatly", many of the issues are hard limits that prevent >>>>>>>>>> the >>>>>>>>>> use >>>>>>>>>> of >>>>>>>>>> PGO. Do you have a lower bound on how much the size of the PGO >>>>>>>>>> data >>>>>>>>>> must >>>>>>>>>> be >>>>>>>>>> reduced in order to overcome the hard limits? >>>>>>>>> >>>>>>>>> This is a static view: Think about the situation where application >>>>>>>>> size is ever increasing; also think about situation where we want to >>>>>>>>> >>>>>>>>> collect more types of profile data. Think about situation where user >>>>>>>>> want to run pgo binaries on small devices with tiny memory/storage >>>>>>>>> .. >>>>>>>> >>>>>>>> >>>>>>>> If we want to reduce memory overhead at runtime and reduce the size >>>>>>>> of >>>>>>>> the >>>>>>>> raw profile data extracted from the target, there are clear >>>>>>>> solutions. >>>>>>>> Consider that debug info does not need to be loaded into the memory >>>>>>>> image of >>>>>>>> the target; why should information identifying each counter need to >>>>>>>> be? >>>>>>>> A file containing raw profile counters is a subset of a core dump; in >>>>>>>> most >>>>>>>> environments, a core dump does not need to have debug info or symbol >>>>>>>> names >>>>>>>> in it, but can be still be read in full detail in conjunction with >>>>>>>> the >>>>>>>> original binary. >>>>>>> >>>>>>> Yes -- there are many alternatives: >>>>>>> 1) emit the name key mapping as a side data at compile time, or >>>>>>> 2) emit them into nonloadable sections of the object file. >>>>>>> >>>>>>> Compared with the above, LLVM's existing design does have its own >>>>>>> advantage -- making it easier for tool to access 'debug' info for >>>>>>> counters. >>>>>>> >>>>>>> LLVM's coverage testing, on the other hand, take a hybrid approach: It >>>>>>> emits the coverage map as rodata, but does not pass it to the profile >>>>>>> dumper. I think it is better to emit covmap as a side data not >>>>>>> attached to target binary. >>>>>>> >>>>>>> >>>>>>>> >>>>>>>> Thus, as we require that the binary be passed to llvm-profdata, there >>>>>>>> is >>>>>>>> no >>>>>>>> fundamental reason that the memory image of the program, or the raw >>>>>>>> data >>>>>>>> extracted from the program, must have any size overhead besides the >>>>>>>> raw >>>>>>>> values of the counters themselves and any text size increase for >>>>>>>> incrementing them. If we are willing to impose this requirement on >>>>>>>> users, >>>>>>>> then as far as reducing memory overhead at runtime and reducing the >>>>>>>> size >>>>>>>> of >>>>>>>> the raw profile data extracted from the target, using hashed function >>>>>>>> names >>>>>>>> is clearly the wrong direction. >>>>>>>> >>>>>>>> *Without* imposing the requirement of passing the binary to >>>>>>>> llvm-profdata, I >>>>>>>> do like the ability to use hashed function names like you are >>>>>>>> proposing. >>>>>>>> It >>>>>>>> is a simple solution for reducing size overhead of function name >>>>>>>> strings >>>>>>>> with little complexity, as it is just swapping one string for >>>>>>>> another. >>>>>>> >>>>>>> Agree. The good news is that the overhead of hashed function names is >>>>>>> small enough that makes this approach attractive. >>>>>>> >>>>>>> thanks, >>>>>>> >>>>>>> David >>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>>> >>>>>>>>>> Obviously LLVM must be able to support the extremely large >>>>>>>>>> binaries >>>>>>>>>> in >>>>>>>>>> your >>>>>>>>>> configuration (otherwise what use is LLVM as a compiler ;) My >>>>>>>>>> questions >>>>>>>>>> are >>>>>>>>>> primarily aimed at establishing which tradeoffs are acceptable for >>>>>>>>>> supporting this (both for LLVM and for you guys). >>>>>>>>> >>>>>>>>> As I said, with the modified proposal (after getting your feedback), >>>>>>>>> no PGO users in LLVM land is going to lose anything/functionality. >>>>>>>>> The >>>>>>>>> end result will be net win for general users of LLVM (even though >>>>>>>>> your >>>>>>>>> customers don't care about it), not just 'us' as you have mentioned >>>>>>>>> many times. >>>>>>>>> >>>>>>>>>> >>>>>>>>>> Btw, for us, the issue of PGO data size is not completely >>>>>>>>>> immaterial >>>>>>>>>> but >>>>>>>>>> is >>>>>>>>>> very different from your use case. For us, the primary issue is >>>>>>>>>> the >>>>>>>>>> additional memory use at run time, since PS4 games usually use >>>>>>>>>> "all" >>>>>>>>>> available memory. We had a problem with UBSan where the large >>>>>>>>>> amount >>>>>>>>>> of >>>>>>>>>> memory required for storing the UBSan diagnostic data at runtime >>>>>>>>>> required >>>>>>>>>> the game programmers to manually change their memory map to make >>>>>>>>>> room. >>>>>>>>>> +Filipe, do you remember how much memory UBSan was using that >>>>>>>>>> caused >>>>>>>>>> a >>>>>>>>>> problem? >>>>>>>>>> >>>>>>>>> >>>>>>>>> My proposal does help reducing rodata size significantly. >>>>>>>> >>>>>>>> >>>>>>>> Yes, that is why I think that this is a useful thing to do. I just >>>>>>>> want >>>>>>>> to >>>>>>>> be careful about existing use cases and the relevant workflow issues. >>>>>>>> >>>>>>>> -- Sean Silva >>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> David >>>>>>>>> >>>>>>>>> >>>>>>>>>> -- Sean Silva >>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> In the raw format, md5 sum key can be an embedded field in the >>>>>>>>>>> prf_data variable instead of as different var referenced by >>>>>>>>>>> prf_data. >>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> If this is not the case, you should show your current patch so >>>>>>>>>>>> that >>>>>>>>>>>> we >>>>>>>>>>>> can >>>>>>>>>>>> discuss things concretely. >>>>>>>>>>> >>>>>>>>>>> It is not. See above about the difference. >>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> It will be >>>>>>>>>>>>>>> very messy to support multiple formats in instr-codegen and >>>>>>>>>>>>>>> instr-runtime. For compatibility concerns, the reader is >>>>>>>>>>>>>>> taught >>>>>>>>>>>>>>> to >>>>>>>>>>>>>>> support previous format, but the changes there are isolated >>>>>>>>>>>>>>> (also >>>>>>>>>>>>>>> expected to be removed in the future). >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> My primary concern is that if the function name are not >>>>>>>>>>>>>>>> kept >>>>>>>>>>>>>>>> at >>>>>>>>>>>>>>>> all >>>>>>>>>>>>>>>> stages, >>>>>>>>>>>>>>>> then it becomes difficult to analyze the profile data in >>>>>>>>>>>>>>>> a >>>>>>>>>>>>>>>> standalone >>>>>>>>>>>>>>>> way. >>>>>>>>>>>>>>>> Many times, I have used `llvm-profdata show >>>>>>>>>>>>>>>> -all-functions >>>>>>>>>>>>>>>> foo.profdata` >>>>>>>>>>>>>>>> on >>>>>>>>>>>>>>>> the resulting profile data and then imported that data >>>>>>>>>>>>>>>> into >>>>>>>>>>>>>>>> Mathematica >>>>>>>>>>>>>>>> for >>>>>>>>>>>>>>>> analysis. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> This is certainly a very valid use case. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> My understanding of your proposal is that `llvm-profdata >>>>>>>>>>>>>>>> show >>>>>>>>>>>>>>>> -all-functions foo.profdata` will not show the actual >>>>>>>>>>>>>>>> function >>>>>>>>>>>>>>>> names >>>>>>>>>>>>>>>> but >>>>>>>>>>>>>>>> instead MD5 hashes, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Yes. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> To support your use case, there are two solutions: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> 1) user can add -fcoverage-mapping option in the build >>>>>>>>>>>>>>> 2) introduce a new option -fprofile-instr-names that force >>>>>>>>>>>>>>> the >>>>>>>>>>>>>>> emission of the name sections in the .o file. This is >>>>>>>>>>>>>>> similar >>>>>>>>>>>>>>> to >>>>>>>>>>>>>>> 1), >>>>>>>>>>>>>>> but no covmap section is needed. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> llvm-profdata tool will be taught to read the name section >>>>>>>>>>>>>>> and >>>>>>>>>>>>>>> attach >>>>>>>>>>>>>>> function names to the profile records. >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> Needing to pass the executable to llvm-profdata would cause >>>>>>>>>>>>>> deployment >>>>>>>>>>>>>> issues for my customers in practice. >>>>>>>>>>>>> >>>>>>>>>>>>> Why? The deployment needs to pass the profile data anyway >>>>>>>>>>>>> right? >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Yes, but not the executable. >>>>>>>>>>>> >>>>>>>>>>>> The PGO training run is likely being run by a gameplay tester >>>>>>>>>>>> (non-programmer). In general the binary will not be lying >>>>>>>>>>>> around >>>>>>>>>>>> as a >>>>>>>>>>>> loose >>>>>>>>>>>> file anywhere, it will be part of a full package of the >>>>>>>>>>>> binary+assets >>>>>>>>>>>> (think >>>>>>>>>>>> like what will end up on a bluray disc). A game's binary >>>>>>>>>>>> *completely >>>>>>>>>>>> useless* without the assets, so except locally on a >>>>>>>>>>>> programmer's >>>>>>>>>>>> machine >>>>>>>>>>>> while they iterate/debug, there is no reason for a binary to >>>>>>>>>>>> ever >>>>>>>>>>>> exist >>>>>>>>>>>> as a >>>>>>>>>>>> standalone file. >>>>>>>>>>>> >>>>>>>>>>>> I'm not saying that needing the binary is insurmountable in any >>>>>>>>>>>> particular >>>>>>>>>>>> scenario. Just that it will cause a strict increase in the >>>>>>>>>>>> number >>>>>>>>>>>> of >>>>>>>>>>>> issues >>>>>>>>>>>> to deploying PGO. >>>>>>>>>>> >>>>>>>>>>> Your concern is acknowledged. >>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> These are much bigger "compatibility concerns" for me than for >>>>>>>>>>>> newer >>>>>>>>>>>> toolchains to accept the old format. For a change in format I >>>>>>>>>>>> can >>>>>>>>>>>> easily >>>>>>>>>>>> tell my users to replace an exe with a newer one and that is >>>>>>>>>>>> all >>>>>>>>>>>> they >>>>>>>>>>>> need >>>>>>>>>>>> to do and it takes 10 seconds, guaranteed. A workflow change is >>>>>>>>>>>> potentially >>>>>>>>>>>> a massive disruption and guaranteed to take more than 10 >>>>>>>>>>>> seconds >>>>>>>>>>>> to >>>>>>>>>>>> fix >>>>>>>>>>>> (perhaps hours or days). >>>>>>>>>>> >>>>>>>>>>> ok. >>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> This >>>>>>>>>>>>> is no different from llvm-cov usage model. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> In practice, getting the performance of PGO is a higher >>>>>>>>>>>> priority >>>>>>>>>>>> for >>>>>>>>>>>> my >>>>>>>>>>>> users, so we should not assume that llvm-cov is being used. >>>>>>>>>>> >>>>>>>>>>> Glad to hear that :) >>>>>>>>>>> >>>>>>>>>>> thanks, >>>>>>>>>>> >>>>>>>>>>> David >>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> -- Sean Silva >>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> David >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Note that with 1) or 2), the user can still benefit from >>>>>>>>>>>>>>> the >>>>>>>>>>>>>>> reduced >>>>>>>>>>>>>>> profile size. >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> Let me reiterate that the size of the profile is not a >>>>>>>>>>>>>> problem >>>>>>>>>>>>>> I >>>>>>>>>>>>>> have >>>>>>>>>>>>>> observed in practice (nor have I heard of this being a >>>>>>>>>>>>>> problem >>>>>>>>>>>>>> in >>>>>>>>>>>>>> practice >>>>>>>>>>>>>> until this thread). Therefore I'm skeptical of any changes >>>>>>>>>>>>>> to >>>>>>>>>>>>>> our >>>>>>>>>>>>>> default >>>>>>>>>>>>>> behavior or any new requirements that are not opt-in. >>>>>>>>>>>>>> >>>>>>>>>>>>>> -- Sean Silva >>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> thanks, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> David >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> which will make it more difficult for me to do this kind >>>>>>>>>>>>>>>> of analysis (would require using nm on the original >>>>>>>>>>>>>>>> binary, >>>>>>>>>>>>>>>> hashing >>>>>>>>>>>>>>>> everything, etc.). >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> btw, feel free to attach the patch even if it in a rough >>>>>>>>>>>>>>>> state. >>>>>>>>>>>>>>>> It >>>>>>>>>>>>>>>> can >>>>>>>>>>>>>>>> still >>>>>>>>>>>>>>>> help to clarify the proposal and be a good talking point. >>>>>>>>>>>>>>>> Fine-grained >>>>>>>>>>>>>>>> patch >>>>>>>>>>>>>>>> review for caring about the rough parts will happen on >>>>>>>>>>>>>>>> llvm-commits; >>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>> rough parts will not distract the discussion here on >>>>>>>>>>>>>>>> llvm-dev. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> -- Sean Silva >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> thanks, >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> David >>>>>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>>>>> LLVM Developers mailing list >>>>>>>>>>>>>>>>> llvm-dev at lists.llvm.org >>>>>>>>>>>>>>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>> >>>>>>>> >>>>>> >>>>>> >>>> >>>> > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Xinliang David Li via llvm-dev
2015-Dec-14 17:31 UTC
[llvm-dev] RFC: Reducing Instr PGO size overhead
Sorry about the typo -- I meant indexed format version won't be changed. thanks, David On Mon, Dec 14, 2015 at 8:55 AM, Bob Wilson <bob.wilson at apple.com> wrote:> >> On Dec 9, 2015, at 1:12 PM, Xinliang David Li via llvm-dev <llvm-dev at lists.llvm.org> wrote: >> >> We are now very close to push the final stage of the PGO size >> reduction task. Here is the updated plan going forward: >> >> 1) In this round, the format of the indexed profile data won't be unchanged. > > That’s a double negative. Was that intentional? I.E., are you changing the format? > >> 2) there will be *no* changes in user interfaces to all profiling >> related tools including llvm-profdata, llvm-cov -- the change will be >> transparent in terms of PGO usability. >> 3) The implementation will be using compression for the function name >> section (the compression ratio is about 5:1). As a result, the linker >> input object size, unstripped binary size, stripped binary size, >> process image size, and raw profile data size will all be greatly >> reduced; >> 4) The change will also greatly improve usability of coverage-mapping >> due to the reduced data size in memory. >> >> Before the final patch, there are a few more small preparation patches >> : 1) abstract naming reading into a class (ProfSymtab) (currently the >> reader uses/assumes the raw/uncompressed object section. 2) add >> compression support in ProfSymtab. >> >> thanks, >> >> David >> >> >> >> >> >> On Wed, Oct 14, 2015 at 12:30 AM, Xinliang David Li <davidxl at google.com> wrote: >>> I plan to divide the patch into series of small patches. The >>> preparation patches will mostly be refactoring changes with NFC. This >>> will minimize the size of final patch(es) with functional changes to >>> help discussions. >>> >>> thanks, >>> >>> David >>> >>> On Fri, Oct 9, 2015 at 4:06 PM, Xinliang David Li <davidxl at google.com> wrote: >>>> On Fri, Oct 9, 2015 at 3:58 PM, Sean Silva <chisophugis at gmail.com> wrote: >>>>> >>>>> >>>>> On Wed, Oct 7, 2015 at 11:12 PM, Xinliang David Li <davidxl at google.com> >>>>> wrote: >>>>>> >>>>>> There is no further response to this, so I will assume general >>>>>> direction of solution-3 is acceptable ;) >>>> >>>>> No response does not mean "LGTM". >>>>> >>>> >>>> What I meant is that the discussion can be moved on to the formal code >>>> review. I have not yet submitted the final patch for review yet. >>>> Before that is ready, continue using this thread to voice your >>>> concerns. >>>> >>>> David >>>> >>>> >>>> >>>> >>>>> -- Sean Silva >>>>> >>>>>> >>>>>> >>>>>> Solution-3 can be further improved. Instead of using static symbol >>>>>> table (with zero size __llvm_prf_nm symbols) to store function names >>>>>> for profile display and coverage mapping, the function names can be >>>>>> stored compressed in a non-allocatable section. The compression ratio >>>>>> for function name strings can be very high (~5x). The covmapping data >>>>>> can also be made non-allocatable. >>>>>> >>>>>> thanks, >>>>>> >>>>>> David >>>>>> >>>>>> On Tue, Sep 29, 2015 at 9:52 AM, Xinliang David Li <davidxl at google.com> >>>>>> wrote: >>>>>>> Sorry for the late update. I finally found time to implement a solution >>>>>>> (Solution-3) that has the best size savings (for both PGO and coverage >>>>>>> testing) with symbolic information available. Here is a re-cap of what >>>>>>> we >>>>>>> have discussed so far: >>>>>>> >>>>>>> Solution-1: >>>>>>> >>>>>>> This is the original solution proposed. In this solution, A function >>>>>>> name's >>>>>>> MD5 hash is used as the primary key (combined with IR hash) for function >>>>>>> lookup. __llvm_prf_names section won't be emitted into the binary nor >>>>>>> dumped >>>>>>> into the profile data unless -fcoverage-mapping is also specified. >>>>>>> >>>>>>> Pros: >>>>>>> 1. maximal reduction of instrumentation binary process image >>>>>>> size >>>>>>> 2. maximal reduction of object and unstripped binary size >>>>>>> 3. maximal reduction of raw profile data size >>>>>>> 4. maximal reduction of indexed profile data size >>>>>>> >>>>>>> Cons: >>>>>>> 1. -fcoverage-mapping use case is not handled -- the size >>>>>>> problem >>>>>>> still exist >>>>>>> 2. profile dump with llvm-profdata no longer have function names >>>>>>> associated -- user needs to use postprocessing tool to get the >>>>>>> functionality >>>>>>> 3. function filtering with partial function name is not >>>>>>> supported >>>>>>> 4. Requires incompatible format change >>>>>>> >>>>>>> >>>>>>> Solution-2: (http://reviews.llvm.org/D12715) >>>>>>> >>>>>>> In this solution, the MD5 hash string is used to replace the raw name >>>>>>> string >>>>>>> Pros: >>>>>>> 1. Very simple to implement >>>>>>> 2. have good reduction of all sizes for typical large C++ >>>>>>> applications >>>>>>> 3. No profile data format change is required. >>>>>>> >>>>>>> Cons: >>>>>>> 1. Still requires 16byte overhead per-function -- can actually >>>>>>> hurt C programs >>>>>>> 2. -fcoverage-mapping use case is still not handled >>>>>>> 3. The problem with llvm-profdata still exists (no symbolic >>>>>>> info, >>>>>>> partial filtering support) >>>>>>> >>>>>>> >>>>>>> Solution-3: >>>>>>> >>>>>>> This is the new solution I am proposing. It is basically an enhancement >>>>>>> of >>>>>>> Solution-1 with most of the weakness resolved. The difference with >>>>>>> Solution-1 is >>>>>>> 1. Function name symbols are emitted into the symbol table as weak >>>>>>> externs. They don't occupy any space at runtime and can be easily >>>>>>> stripped. >>>>>>> 2. -fcoverage-mapping does not need special handling -- it >>>>>>> automatically benefit from the same size saving. >>>>>>> 3. llvm-cov is changed to read symbol info from the symtab instead >>>>>>> of >>>>>>> reading them from the name section data >>>>>>> 4. llvm-profdata is enhanced to take a binary as input and dump >>>>>>> profile with names attached. Function filtering is fully supported >>>>>>> (option >>>>>>> can also be introduced to force dumping names into binary and profile >>>>>>> data, >>>>>>> so that llvm-profdata use case is not changed at all). >>>>>>> >>>>>>> Pros: >>>>>>> 1. All the pros from Solution-1 >>>>>>> 2. Size savings for coverage-mapping case >>>>>>> Cons: >>>>>>> Format change is required for profile data and coverage mapping. >>>>>>> >>>>>>> The initial patch is here: http://reviews.llvm.org/D13251 >>>>>>> >>>>>>> With this patch, the size of a release clang binary with coverage >>>>>>> mapping is >>>>>>> reduced from 986M to 569M. >>>>>>> >>>>>>> If there are no major concerns, I will carve out the patch into smaller >>>>>>> ones for review. >>>>>>> >>>>>>> thanks, >>>>>>> >>>>>>> David >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Tue, Sep 8, 2015 at 3:47 PM, Xinliang David Li <davidxl at google.com> >>>>>>> wrote: >>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> yes -- it is fixed length (8byte) blob which may include null >>>>>>>>>>>> byte >>>>>>>>>>>> in >>>>>>>>>>>> the middle. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> For reference, MD5 sum is 16 bytes (128-bit): >>>>>>>>>>> https://en.wikipedia.org/wiki/MD5 >>>>>>>>>> >>>>>>>>>> yes, LLVM's MD5 hash only takes the lower 64bit. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> Or to say it another way, suppose that Itanium mangling >>>>>>>>>>>>> required >>>>>>>>>>>>> as a >>>>>>>>>>>>> final >>>>>>>>>>>>> step to replace the string with its md5 sum in hex. Therefore >>>>>>>>>>>>> all >>>>>>>>>>>>> symbol >>>>>>>>>>>>> names are "small". My understanding is that this is effectively >>>>>>>>>>>>> all >>>>>>>>>>>>> your >>>>>>>>>>>>> patch is doing. >>>>>>>>>>>> >>>>>>>>>>>> The key type before the change is StringRef, while the after the >>>>>>>>>>>> key >>>>>>>>>>>> type is uint64_t. Are you suggesting treating uint64_t md5 sum >>>>>>>>>>>> key >>>>>>>>>>>> as >>>>>>>>>>>> a string of 8 bytes or storing md5 has in text form which will >>>>>>>>>>>> double >>>>>>>>>>>> the size? >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> How much does this change the benefit? If most of the benefit is >>>>>>>>>>> avoiding >>>>>>>>>>> extraordinarily long mangled names then it may be sufficient. >>>>>>>>>>> >>>>>>>>>>> With IR-level instrumentation like Rong is pursuing the size may >>>>>>>>>>> be >>>>>>>>>>> reduced >>>>>>>>>>> sufficiently that we do not need the optimization proposed in this >>>>>>>>>>> thread. >>>>>>>>>>> For example, Rong found >2x size reduction on Google's C++ >>>>>>>>>>> benchmarks, >>>>>>>>>>> which >>>>>>>>>>> I assume are representative of the extremely large Google binaries >>>>>>>>>>> that >>>>>>>>>>> are >>>>>>>>>>> causing the problems addressed by your proposal in this thread. >>>>>>>>>>> The >>>>>>>>>>> measurements you mention for Clang in this thread provide similar >>>>>>>>>>> size >>>>>>>>>>> reductions, so Rong's approach may be sufficient (especially >>>>>>>>>>> because >>>>>>>>>>> functions with extremely large mangled names tend to be small >>>>>>>>>>> inline >>>>>>>>>>> functions in header-only template libraries). >>>>>>>>>> >>>>>>>>>> Late instrumentation helps many cases. In some cases (as shown in >>>>>>>>>> SPEC), the reduction in size is not as large. Reducing PGO overhead >>>>>>>>>> will lower the bar for its adoption. >>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Of the points you mention in "Large size of overhead can limit the >>>>>>>>>>> usability >>>>>>>>>>> of PGO greatly", many of the issues are hard limits that prevent >>>>>>>>>>> the >>>>>>>>>>> use >>>>>>>>>>> of >>>>>>>>>>> PGO. Do you have a lower bound on how much the size of the PGO >>>>>>>>>>> data >>>>>>>>>>> must >>>>>>>>>>> be >>>>>>>>>>> reduced in order to overcome the hard limits? >>>>>>>>>> >>>>>>>>>> This is a static view: Think about the situation where application >>>>>>>>>> size is ever increasing; also think about situation where we want to >>>>>>>>>> >>>>>>>>>> collect more types of profile data. Think about situation where user >>>>>>>>>> want to run pgo binaries on small devices with tiny memory/storage >>>>>>>>>> .. >>>>>>>>> >>>>>>>>> >>>>>>>>> If we want to reduce memory overhead at runtime and reduce the size >>>>>>>>> of >>>>>>>>> the >>>>>>>>> raw profile data extracted from the target, there are clear >>>>>>>>> solutions. >>>>>>>>> Consider that debug info does not need to be loaded into the memory >>>>>>>>> image of >>>>>>>>> the target; why should information identifying each counter need to >>>>>>>>> be? >>>>>>>>> A file containing raw profile counters is a subset of a core dump; in >>>>>>>>> most >>>>>>>>> environments, a core dump does not need to have debug info or symbol >>>>>>>>> names >>>>>>>>> in it, but can be still be read in full detail in conjunction with >>>>>>>>> the >>>>>>>>> original binary. >>>>>>>> >>>>>>>> Yes -- there are many alternatives: >>>>>>>> 1) emit the name key mapping as a side data at compile time, or >>>>>>>> 2) emit them into nonloadable sections of the object file. >>>>>>>> >>>>>>>> Compared with the above, LLVM's existing design does have its own >>>>>>>> advantage -- making it easier for tool to access 'debug' info for >>>>>>>> counters. >>>>>>>> >>>>>>>> LLVM's coverage testing, on the other hand, take a hybrid approach: It >>>>>>>> emits the coverage map as rodata, but does not pass it to the profile >>>>>>>> dumper. I think it is better to emit covmap as a side data not >>>>>>>> attached to target binary. >>>>>>>> >>>>>>>> >>>>>>>>> >>>>>>>>> Thus, as we require that the binary be passed to llvm-profdata, there >>>>>>>>> is >>>>>>>>> no >>>>>>>>> fundamental reason that the memory image of the program, or the raw >>>>>>>>> data >>>>>>>>> extracted from the program, must have any size overhead besides the >>>>>>>>> raw >>>>>>>>> values of the counters themselves and any text size increase for >>>>>>>>> incrementing them. If we are willing to impose this requirement on >>>>>>>>> users, >>>>>>>>> then as far as reducing memory overhead at runtime and reducing the >>>>>>>>> size >>>>>>>>> of >>>>>>>>> the raw profile data extracted from the target, using hashed function >>>>>>>>> names >>>>>>>>> is clearly the wrong direction. >>>>>>>>> >>>>>>>>> *Without* imposing the requirement of passing the binary to >>>>>>>>> llvm-profdata, I >>>>>>>>> do like the ability to use hashed function names like you are >>>>>>>>> proposing. >>>>>>>>> It >>>>>>>>> is a simple solution for reducing size overhead of function name >>>>>>>>> strings >>>>>>>>> with little complexity, as it is just swapping one string for >>>>>>>>> another. >>>>>>>> >>>>>>>> Agree. The good news is that the overhead of hashed function names is >>>>>>>> small enough that makes this approach attractive. >>>>>>>> >>>>>>>> thanks, >>>>>>>> >>>>>>>> David >>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Obviously LLVM must be able to support the extremely large >>>>>>>>>>> binaries >>>>>>>>>>> in >>>>>>>>>>> your >>>>>>>>>>> configuration (otherwise what use is LLVM as a compiler ;) My >>>>>>>>>>> questions >>>>>>>>>>> are >>>>>>>>>>> primarily aimed at establishing which tradeoffs are acceptable for >>>>>>>>>>> supporting this (both for LLVM and for you guys). >>>>>>>>>> >>>>>>>>>> As I said, with the modified proposal (after getting your feedback), >>>>>>>>>> no PGO users in LLVM land is going to lose anything/functionality. >>>>>>>>>> The >>>>>>>>>> end result will be net win for general users of LLVM (even though >>>>>>>>>> your >>>>>>>>>> customers don't care about it), not just 'us' as you have mentioned >>>>>>>>>> many times. >>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Btw, for us, the issue of PGO data size is not completely >>>>>>>>>>> immaterial >>>>>>>>>>> but >>>>>>>>>>> is >>>>>>>>>>> very different from your use case. For us, the primary issue is >>>>>>>>>>> the >>>>>>>>>>> additional memory use at run time, since PS4 games usually use >>>>>>>>>>> "all" >>>>>>>>>>> available memory. We had a problem with UBSan where the large >>>>>>>>>>> amount >>>>>>>>>>> of >>>>>>>>>>> memory required for storing the UBSan diagnostic data at runtime >>>>>>>>>>> required >>>>>>>>>>> the game programmers to manually change their memory map to make >>>>>>>>>>> room. >>>>>>>>>>> +Filipe, do you remember how much memory UBSan was using that >>>>>>>>>>> caused >>>>>>>>>>> a >>>>>>>>>>> problem? >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> My proposal does help reducing rodata size significantly. >>>>>>>>> >>>>>>>>> >>>>>>>>> Yes, that is why I think that this is a useful thing to do. I just >>>>>>>>> want >>>>>>>>> to >>>>>>>>> be careful about existing use cases and the relevant workflow issues. >>>>>>>>> >>>>>>>>> -- Sean Silva >>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> David >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> -- Sean Silva >>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> In the raw format, md5 sum key can be an embedded field in the >>>>>>>>>>>> prf_data variable instead of as different var referenced by >>>>>>>>>>>> prf_data. >>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> If this is not the case, you should show your current patch so >>>>>>>>>>>>> that >>>>>>>>>>>>> we >>>>>>>>>>>>> can >>>>>>>>>>>>> discuss things concretely. >>>>>>>>>>>> >>>>>>>>>>>> It is not. See above about the difference. >>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> It will be >>>>>>>>>>>>>>>> very messy to support multiple formats in instr-codegen and >>>>>>>>>>>>>>>> instr-runtime. For compatibility concerns, the reader is >>>>>>>>>>>>>>>> taught >>>>>>>>>>>>>>>> to >>>>>>>>>>>>>>>> support previous format, but the changes there are isolated >>>>>>>>>>>>>>>> (also >>>>>>>>>>>>>>>> expected to be removed in the future). >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> My primary concern is that if the function name are not >>>>>>>>>>>>>>>>> kept >>>>>>>>>>>>>>>>> at >>>>>>>>>>>>>>>>> all >>>>>>>>>>>>>>>>> stages, >>>>>>>>>>>>>>>>> then it becomes difficult to analyze the profile data in >>>>>>>>>>>>>>>>> a >>>>>>>>>>>>>>>>> standalone >>>>>>>>>>>>>>>>> way. >>>>>>>>>>>>>>>>> Many times, I have used `llvm-profdata show >>>>>>>>>>>>>>>>> -all-functions >>>>>>>>>>>>>>>>> foo.profdata` >>>>>>>>>>>>>>>>> on >>>>>>>>>>>>>>>>> the resulting profile data and then imported that data >>>>>>>>>>>>>>>>> into >>>>>>>>>>>>>>>>> Mathematica >>>>>>>>>>>>>>>>> for >>>>>>>>>>>>>>>>> analysis. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> This is certainly a very valid use case. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> My understanding of your proposal is that `llvm-profdata >>>>>>>>>>>>>>>>> show >>>>>>>>>>>>>>>>> -all-functions foo.profdata` will not show the actual >>>>>>>>>>>>>>>>> function >>>>>>>>>>>>>>>>> names >>>>>>>>>>>>>>>>> but >>>>>>>>>>>>>>>>> instead MD5 hashes, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Yes. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> To support your use case, there are two solutions: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> 1) user can add -fcoverage-mapping option in the build >>>>>>>>>>>>>>>> 2) introduce a new option -fprofile-instr-names that force >>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>> emission of the name sections in the .o file. This is >>>>>>>>>>>>>>>> similar >>>>>>>>>>>>>>>> to >>>>>>>>>>>>>>>> 1), >>>>>>>>>>>>>>>> but no covmap section is needed. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> llvm-profdata tool will be taught to read the name section >>>>>>>>>>>>>>>> and >>>>>>>>>>>>>>>> attach >>>>>>>>>>>>>>>> function names to the profile records. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Needing to pass the executable to llvm-profdata would cause >>>>>>>>>>>>>>> deployment >>>>>>>>>>>>>>> issues for my customers in practice. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Why? The deployment needs to pass the profile data anyway >>>>>>>>>>>>>> right? >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> Yes, but not the executable. >>>>>>>>>>>>> >>>>>>>>>>>>> The PGO training run is likely being run by a gameplay tester >>>>>>>>>>>>> (non-programmer). In general the binary will not be lying >>>>>>>>>>>>> around >>>>>>>>>>>>> as a >>>>>>>>>>>>> loose >>>>>>>>>>>>> file anywhere, it will be part of a full package of the >>>>>>>>>>>>> binary+assets >>>>>>>>>>>>> (think >>>>>>>>>>>>> like what will end up on a bluray disc). A game's binary >>>>>>>>>>>>> *completely >>>>>>>>>>>>> useless* without the assets, so except locally on a >>>>>>>>>>>>> programmer's >>>>>>>>>>>>> machine >>>>>>>>>>>>> while they iterate/debug, there is no reason for a binary to >>>>>>>>>>>>> ever >>>>>>>>>>>>> exist >>>>>>>>>>>>> as a >>>>>>>>>>>>> standalone file. >>>>>>>>>>>>> >>>>>>>>>>>>> I'm not saying that needing the binary is insurmountable in any >>>>>>>>>>>>> particular >>>>>>>>>>>>> scenario. Just that it will cause a strict increase in the >>>>>>>>>>>>> number >>>>>>>>>>>>> of >>>>>>>>>>>>> issues >>>>>>>>>>>>> to deploying PGO. >>>>>>>>>>>> >>>>>>>>>>>> Your concern is acknowledged. >>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> These are much bigger "compatibility concerns" for me than for >>>>>>>>>>>>> newer >>>>>>>>>>>>> toolchains to accept the old format. For a change in format I >>>>>>>>>>>>> can >>>>>>>>>>>>> easily >>>>>>>>>>>>> tell my users to replace an exe with a newer one and that is >>>>>>>>>>>>> all >>>>>>>>>>>>> they >>>>>>>>>>>>> need >>>>>>>>>>>>> to do and it takes 10 seconds, guaranteed. A workflow change is >>>>>>>>>>>>> potentially >>>>>>>>>>>>> a massive disruption and guaranteed to take more than 10 >>>>>>>>>>>>> seconds >>>>>>>>>>>>> to >>>>>>>>>>>>> fix >>>>>>>>>>>>> (perhaps hours or days). >>>>>>>>>>>> >>>>>>>>>>>> ok. >>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> This >>>>>>>>>>>>>> is no different from llvm-cov usage model. >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> In practice, getting the performance of PGO is a higher >>>>>>>>>>>>> priority >>>>>>>>>>>>> for >>>>>>>>>>>>> my >>>>>>>>>>>>> users, so we should not assume that llvm-cov is being used. >>>>>>>>>>>> >>>>>>>>>>>> Glad to hear that :) >>>>>>>>>>>> >>>>>>>>>>>> thanks, >>>>>>>>>>>> >>>>>>>>>>>> David >>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> -- Sean Silva >>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> David >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Note that with 1) or 2), the user can still benefit from >>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>> reduced >>>>>>>>>>>>>>>> profile size. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Let me reiterate that the size of the profile is not a >>>>>>>>>>>>>>> problem >>>>>>>>>>>>>>> I >>>>>>>>>>>>>>> have >>>>>>>>>>>>>>> observed in practice (nor have I heard of this being a >>>>>>>>>>>>>>> problem >>>>>>>>>>>>>>> in >>>>>>>>>>>>>>> practice >>>>>>>>>>>>>>> until this thread). Therefore I'm skeptical of any changes >>>>>>>>>>>>>>> to >>>>>>>>>>>>>>> our >>>>>>>>>>>>>>> default >>>>>>>>>>>>>>> behavior or any new requirements that are not opt-in. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> -- Sean Silva >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> thanks, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> David >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> which will make it more difficult for me to do this kind >>>>>>>>>>>>>>>>> of analysis (would require using nm on the original >>>>>>>>>>>>>>>>> binary, >>>>>>>>>>>>>>>>> hashing >>>>>>>>>>>>>>>>> everything, etc.). >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> btw, feel free to attach the patch even if it in a rough >>>>>>>>>>>>>>>>> state. >>>>>>>>>>>>>>>>> It >>>>>>>>>>>>>>>>> can >>>>>>>>>>>>>>>>> still >>>>>>>>>>>>>>>>> help to clarify the proposal and be a good talking point. >>>>>>>>>>>>>>>>> Fine-grained >>>>>>>>>>>>>>>>> patch >>>>>>>>>>>>>>>>> review for caring about the rough parts will happen on >>>>>>>>>>>>>>>>> llvm-commits; >>>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>> rough parts will not distract the discussion here on >>>>>>>>>>>>>>>>> llvm-dev. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> -- Sean Silva >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> thanks, >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> David >>>>>>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>>>>>> LLVM Developers mailing list >>>>>>>>>>>>>>>>>> llvm-dev at lists.llvm.org >>>>>>>>>>>>>>>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>> >>>>>>> >>>>> >>>>> >> _______________________________________________ >> LLVM Developers mailing list >> llvm-dev at lists.llvm.org >> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >
Bob Wilson via llvm-dev
2015-Dec-14 18:08 UTC
[llvm-dev] RFC: Reducing Instr PGO size overhead
Great — thanks for confirming that.> On Dec 14, 2015, at 9:31 AM, Xinliang David Li <davidxl at google.com> wrote: > > Sorry about the typo -- I meant indexed format version won't be changed. > > thanks, > > David > > On Mon, Dec 14, 2015 at 8:55 AM, Bob Wilson <bob.wilson at apple.com> wrote: >> >>> On Dec 9, 2015, at 1:12 PM, Xinliang David Li via llvm-dev <llvm-dev at lists.llvm.org> wrote: >>> >>> We are now very close to push the final stage of the PGO size >>> reduction task. Here is the updated plan going forward: >>> >>> 1) In this round, the format of the indexed profile data won't be unchanged. >> >> That’s a double negative. Was that intentional? I.E., are you changing the format? >> >>> 2) there will be *no* changes in user interfaces to all profiling >>> related tools including llvm-profdata, llvm-cov -- the change will be >>> transparent in terms of PGO usability. >>> 3) The implementation will be using compression for the function name >>> section (the compression ratio is about 5:1). As a result, the linker >>> input object size, unstripped binary size, stripped binary size, >>> process image size, and raw profile data size will all be greatly >>> reduced; >>> 4) The change will also greatly improve usability of coverage-mapping >>> due to the reduced data size in memory. >>> >>> Before the final patch, there are a few more small preparation patches >>> : 1) abstract naming reading into a class (ProfSymtab) (currently the >>> reader uses/assumes the raw/uncompressed object section. 2) add >>> compression support in ProfSymtab. >>> >>> thanks, >>> >>> David >>> >>> >>> >>> >>> >>> On Wed, Oct 14, 2015 at 12:30 AM, Xinliang David Li <davidxl at google.com> wrote: >>>> I plan to divide the patch into series of small patches. The >>>> preparation patches will mostly be refactoring changes with NFC. This >>>> will minimize the size of final patch(es) with functional changes to >>>> help discussions. >>>> >>>> thanks, >>>> >>>> David >>>> >>>> On Fri, Oct 9, 2015 at 4:06 PM, Xinliang David Li <davidxl at google.com> wrote: >>>>> On Fri, Oct 9, 2015 at 3:58 PM, Sean Silva <chisophugis at gmail.com> wrote: >>>>>> >>>>>> >>>>>> On Wed, Oct 7, 2015 at 11:12 PM, Xinliang David Li <davidxl at google.com> >>>>>> wrote: >>>>>>> >>>>>>> There is no further response to this, so I will assume general >>>>>>> direction of solution-3 is acceptable ;) >>>>> >>>>>> No response does not mean "LGTM". >>>>>> >>>>> >>>>> What I meant is that the discussion can be moved on to the formal code >>>>> review. I have not yet submitted the final patch for review yet. >>>>> Before that is ready, continue using this thread to voice your >>>>> concerns. >>>>> >>>>> David >>>>> >>>>> >>>>> >>>>> >>>>>> -- Sean Silva >>>>>> >>>>>>> >>>>>>> >>>>>>> Solution-3 can be further improved. Instead of using static symbol >>>>>>> table (with zero size __llvm_prf_nm symbols) to store function names >>>>>>> for profile display and coverage mapping, the function names can be >>>>>>> stored compressed in a non-allocatable section. The compression ratio >>>>>>> for function name strings can be very high (~5x). The covmapping data >>>>>>> can also be made non-allocatable. >>>>>>> >>>>>>> thanks, >>>>>>> >>>>>>> David >>>>>>> >>>>>>> On Tue, Sep 29, 2015 at 9:52 AM, Xinliang David Li <davidxl at google.com> >>>>>>> wrote: >>>>>>>> Sorry for the late update. I finally found time to implement a solution >>>>>>>> (Solution-3) that has the best size savings (for both PGO and coverage >>>>>>>> testing) with symbolic information available. Here is a re-cap of what >>>>>>>> we >>>>>>>> have discussed so far: >>>>>>>> >>>>>>>> Solution-1: >>>>>>>> >>>>>>>> This is the original solution proposed. In this solution, A function >>>>>>>> name's >>>>>>>> MD5 hash is used as the primary key (combined with IR hash) for function >>>>>>>> lookup. __llvm_prf_names section won't be emitted into the binary nor >>>>>>>> dumped >>>>>>>> into the profile data unless -fcoverage-mapping is also specified. >>>>>>>> >>>>>>>> Pros: >>>>>>>> 1. maximal reduction of instrumentation binary process image >>>>>>>> size >>>>>>>> 2. maximal reduction of object and unstripped binary size >>>>>>>> 3. maximal reduction of raw profile data size >>>>>>>> 4. maximal reduction of indexed profile data size >>>>>>>> >>>>>>>> Cons: >>>>>>>> 1. -fcoverage-mapping use case is not handled -- the size >>>>>>>> problem >>>>>>>> still exist >>>>>>>> 2. profile dump with llvm-profdata no longer have function names >>>>>>>> associated -- user needs to use postprocessing tool to get the >>>>>>>> functionality >>>>>>>> 3. function filtering with partial function name is not >>>>>>>> supported >>>>>>>> 4. Requires incompatible format change >>>>>>>> >>>>>>>> >>>>>>>> Solution-2: (http://reviews.llvm.org/D12715) >>>>>>>> >>>>>>>> In this solution, the MD5 hash string is used to replace the raw name >>>>>>>> string >>>>>>>> Pros: >>>>>>>> 1. Very simple to implement >>>>>>>> 2. have good reduction of all sizes for typical large C++ >>>>>>>> applications >>>>>>>> 3. No profile data format change is required. >>>>>>>> >>>>>>>> Cons: >>>>>>>> 1. Still requires 16byte overhead per-function -- can actually >>>>>>>> hurt C programs >>>>>>>> 2. -fcoverage-mapping use case is still not handled >>>>>>>> 3. The problem with llvm-profdata still exists (no symbolic >>>>>>>> info, >>>>>>>> partial filtering support) >>>>>>>> >>>>>>>> >>>>>>>> Solution-3: >>>>>>>> >>>>>>>> This is the new solution I am proposing. It is basically an enhancement >>>>>>>> of >>>>>>>> Solution-1 with most of the weakness resolved. The difference with >>>>>>>> Solution-1 is >>>>>>>> 1. Function name symbols are emitted into the symbol table as weak >>>>>>>> externs. They don't occupy any space at runtime and can be easily >>>>>>>> stripped. >>>>>>>> 2. -fcoverage-mapping does not need special handling -- it >>>>>>>> automatically benefit from the same size saving. >>>>>>>> 3. llvm-cov is changed to read symbol info from the symtab instead >>>>>>>> of >>>>>>>> reading them from the name section data >>>>>>>> 4. llvm-profdata is enhanced to take a binary as input and dump >>>>>>>> profile with names attached. Function filtering is fully supported >>>>>>>> (option >>>>>>>> can also be introduced to force dumping names into binary and profile >>>>>>>> data, >>>>>>>> so that llvm-profdata use case is not changed at all). >>>>>>>> >>>>>>>> Pros: >>>>>>>> 1. All the pros from Solution-1 >>>>>>>> 2. Size savings for coverage-mapping case >>>>>>>> Cons: >>>>>>>> Format change is required for profile data and coverage mapping. >>>>>>>> >>>>>>>> The initial patch is here: http://reviews.llvm.org/D13251 >>>>>>>> >>>>>>>> With this patch, the size of a release clang binary with coverage >>>>>>>> mapping is >>>>>>>> reduced from 986M to 569M. >>>>>>>> >>>>>>>> If there are no major concerns, I will carve out the patch into smaller >>>>>>>> ones for review. >>>>>>>> >>>>>>>> thanks, >>>>>>>> >>>>>>>> David >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Tue, Sep 8, 2015 at 3:47 PM, Xinliang David Li <davidxl at google.com> >>>>>>>> wrote: >>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> yes -- it is fixed length (8byte) blob which may include null >>>>>>>>>>>>> byte >>>>>>>>>>>>> in >>>>>>>>>>>>> the middle. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> For reference, MD5 sum is 16 bytes (128-bit): >>>>>>>>>>>> https://en.wikipedia.org/wiki/MD5 >>>>>>>>>>> >>>>>>>>>>> yes, LLVM's MD5 hash only takes the lower 64bit. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> Or to say it another way, suppose that Itanium mangling >>>>>>>>>>>>>> required >>>>>>>>>>>>>> as a >>>>>>>>>>>>>> final >>>>>>>>>>>>>> step to replace the string with its md5 sum in hex. Therefore >>>>>>>>>>>>>> all >>>>>>>>>>>>>> symbol >>>>>>>>>>>>>> names are "small". My understanding is that this is effectively >>>>>>>>>>>>>> all >>>>>>>>>>>>>> your >>>>>>>>>>>>>> patch is doing. >>>>>>>>>>>>> >>>>>>>>>>>>> The key type before the change is StringRef, while the after the >>>>>>>>>>>>> key >>>>>>>>>>>>> type is uint64_t. Are you suggesting treating uint64_t md5 sum >>>>>>>>>>>>> key >>>>>>>>>>>>> as >>>>>>>>>>>>> a string of 8 bytes or storing md5 has in text form which will >>>>>>>>>>>>> double >>>>>>>>>>>>> the size? >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> How much does this change the benefit? If most of the benefit is >>>>>>>>>>>> avoiding >>>>>>>>>>>> extraordinarily long mangled names then it may be sufficient. >>>>>>>>>>>> >>>>>>>>>>>> With IR-level instrumentation like Rong is pursuing the size may >>>>>>>>>>>> be >>>>>>>>>>>> reduced >>>>>>>>>>>> sufficiently that we do not need the optimization proposed in this >>>>>>>>>>>> thread. >>>>>>>>>>>> For example, Rong found >2x size reduction on Google's C++ >>>>>>>>>>>> benchmarks, >>>>>>>>>>>> which >>>>>>>>>>>> I assume are representative of the extremely large Google binaries >>>>>>>>>>>> that >>>>>>>>>>>> are >>>>>>>>>>>> causing the problems addressed by your proposal in this thread. >>>>>>>>>>>> The >>>>>>>>>>>> measurements you mention for Clang in this thread provide similar >>>>>>>>>>>> size >>>>>>>>>>>> reductions, so Rong's approach may be sufficient (especially >>>>>>>>>>>> because >>>>>>>>>>>> functions with extremely large mangled names tend to be small >>>>>>>>>>>> inline >>>>>>>>>>>> functions in header-only template libraries). >>>>>>>>>>> >>>>>>>>>>> Late instrumentation helps many cases. In some cases (as shown in >>>>>>>>>>> SPEC), the reduction in size is not as large. Reducing PGO overhead >>>>>>>>>>> will lower the bar for its adoption. >>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Of the points you mention in "Large size of overhead can limit the >>>>>>>>>>>> usability >>>>>>>>>>>> of PGO greatly", many of the issues are hard limits that prevent >>>>>>>>>>>> the >>>>>>>>>>>> use >>>>>>>>>>>> of >>>>>>>>>>>> PGO. Do you have a lower bound on how much the size of the PGO >>>>>>>>>>>> data >>>>>>>>>>>> must >>>>>>>>>>>> be >>>>>>>>>>>> reduced in order to overcome the hard limits? >>>>>>>>>>> >>>>>>>>>>> This is a static view: Think about the situation where application >>>>>>>>>>> size is ever increasing; also think about situation where we want to >>>>>>>>>>> >>>>>>>>>>> collect more types of profile data. Think about situation where user >>>>>>>>>>> want to run pgo binaries on small devices with tiny memory/storage >>>>>>>>>>> .. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> If we want to reduce memory overhead at runtime and reduce the size >>>>>>>>>> of >>>>>>>>>> the >>>>>>>>>> raw profile data extracted from the target, there are clear >>>>>>>>>> solutions. >>>>>>>>>> Consider that debug info does not need to be loaded into the memory >>>>>>>>>> image of >>>>>>>>>> the target; why should information identifying each counter need to >>>>>>>>>> be? >>>>>>>>>> A file containing raw profile counters is a subset of a core dump; in >>>>>>>>>> most >>>>>>>>>> environments, a core dump does not need to have debug info or symbol >>>>>>>>>> names >>>>>>>>>> in it, but can be still be read in full detail in conjunction with >>>>>>>>>> the >>>>>>>>>> original binary. >>>>>>>>> >>>>>>>>> Yes -- there are many alternatives: >>>>>>>>> 1) emit the name key mapping as a side data at compile time, or >>>>>>>>> 2) emit them into nonloadable sections of the object file. >>>>>>>>> >>>>>>>>> Compared with the above, LLVM's existing design does have its own >>>>>>>>> advantage -- making it easier for tool to access 'debug' info for >>>>>>>>> counters. >>>>>>>>> >>>>>>>>> LLVM's coverage testing, on the other hand, take a hybrid approach: It >>>>>>>>> emits the coverage map as rodata, but does not pass it to the profile >>>>>>>>> dumper. I think it is better to emit covmap as a side data not >>>>>>>>> attached to target binary. >>>>>>>>> >>>>>>>>> >>>>>>>>>> >>>>>>>>>> Thus, as we require that the binary be passed to llvm-profdata, there >>>>>>>>>> is >>>>>>>>>> no >>>>>>>>>> fundamental reason that the memory image of the program, or the raw >>>>>>>>>> data >>>>>>>>>> extracted from the program, must have any size overhead besides the >>>>>>>>>> raw >>>>>>>>>> values of the counters themselves and any text size increase for >>>>>>>>>> incrementing them. If we are willing to impose this requirement on >>>>>>>>>> users, >>>>>>>>>> then as far as reducing memory overhead at runtime and reducing the >>>>>>>>>> size >>>>>>>>>> of >>>>>>>>>> the raw profile data extracted from the target, using hashed function >>>>>>>>>> names >>>>>>>>>> is clearly the wrong direction. >>>>>>>>>> >>>>>>>>>> *Without* imposing the requirement of passing the binary to >>>>>>>>>> llvm-profdata, I >>>>>>>>>> do like the ability to use hashed function names like you are >>>>>>>>>> proposing. >>>>>>>>>> It >>>>>>>>>> is a simple solution for reducing size overhead of function name >>>>>>>>>> strings >>>>>>>>>> with little complexity, as it is just swapping one string for >>>>>>>>>> another. >>>>>>>>> >>>>>>>>> Agree. The good news is that the overhead of hashed function names is >>>>>>>>> small enough that makes this approach attractive. >>>>>>>>> >>>>>>>>> thanks, >>>>>>>>> >>>>>>>>> David >>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Obviously LLVM must be able to support the extremely large >>>>>>>>>>>> binaries >>>>>>>>>>>> in >>>>>>>>>>>> your >>>>>>>>>>>> configuration (otherwise what use is LLVM as a compiler ;) My >>>>>>>>>>>> questions >>>>>>>>>>>> are >>>>>>>>>>>> primarily aimed at establishing which tradeoffs are acceptable for >>>>>>>>>>>> supporting this (both for LLVM and for you guys). >>>>>>>>>>> >>>>>>>>>>> As I said, with the modified proposal (after getting your feedback), >>>>>>>>>>> no PGO users in LLVM land is going to lose anything/functionality. >>>>>>>>>>> The >>>>>>>>>>> end result will be net win for general users of LLVM (even though >>>>>>>>>>> your >>>>>>>>>>> customers don't care about it), not just 'us' as you have mentioned >>>>>>>>>>> many times. >>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Btw, for us, the issue of PGO data size is not completely >>>>>>>>>>>> immaterial >>>>>>>>>>>> but >>>>>>>>>>>> is >>>>>>>>>>>> very different from your use case. For us, the primary issue is >>>>>>>>>>>> the >>>>>>>>>>>> additional memory use at run time, since PS4 games usually use >>>>>>>>>>>> "all" >>>>>>>>>>>> available memory. We had a problem with UBSan where the large >>>>>>>>>>>> amount >>>>>>>>>>>> of >>>>>>>>>>>> memory required for storing the UBSan diagnostic data at runtime >>>>>>>>>>>> required >>>>>>>>>>>> the game programmers to manually change their memory map to make >>>>>>>>>>>> room. >>>>>>>>>>>> +Filipe, do you remember how much memory UBSan was using that >>>>>>>>>>>> caused >>>>>>>>>>>> a >>>>>>>>>>>> problem? >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> My proposal does help reducing rodata size significantly. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Yes, that is why I think that this is a useful thing to do. I just >>>>>>>>>> want >>>>>>>>>> to >>>>>>>>>> be careful about existing use cases and the relevant workflow issues. >>>>>>>>>> >>>>>>>>>> -- Sean Silva >>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> David >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>> -- Sean Silva >>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> In the raw format, md5 sum key can be an embedded field in the >>>>>>>>>>>>> prf_data variable instead of as different var referenced by >>>>>>>>>>>>> prf_data. >>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> If this is not the case, you should show your current patch so >>>>>>>>>>>>>> that >>>>>>>>>>>>>> we >>>>>>>>>>>>>> can >>>>>>>>>>>>>> discuss things concretely. >>>>>>>>>>>>> >>>>>>>>>>>>> It is not. See above about the difference. >>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> It will be >>>>>>>>>>>>>>>>> very messy to support multiple formats in instr-codegen and >>>>>>>>>>>>>>>>> instr-runtime. For compatibility concerns, the reader is >>>>>>>>>>>>>>>>> taught >>>>>>>>>>>>>>>>> to >>>>>>>>>>>>>>>>> support previous format, but the changes there are isolated >>>>>>>>>>>>>>>>> (also >>>>>>>>>>>>>>>>> expected to be removed in the future). >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> My primary concern is that if the function name are not >>>>>>>>>>>>>>>>>> kept >>>>>>>>>>>>>>>>>> at >>>>>>>>>>>>>>>>>> all >>>>>>>>>>>>>>>>>> stages, >>>>>>>>>>>>>>>>>> then it becomes difficult to analyze the profile data in >>>>>>>>>>>>>>>>>> a >>>>>>>>>>>>>>>>>> standalone >>>>>>>>>>>>>>>>>> way. >>>>>>>>>>>>>>>>>> Many times, I have used `llvm-profdata show >>>>>>>>>>>>>>>>>> -all-functions >>>>>>>>>>>>>>>>>> foo.profdata` >>>>>>>>>>>>>>>>>> on >>>>>>>>>>>>>>>>>> the resulting profile data and then imported that data >>>>>>>>>>>>>>>>>> into >>>>>>>>>>>>>>>>>> Mathematica >>>>>>>>>>>>>>>>>> for >>>>>>>>>>>>>>>>>> analysis. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> This is certainly a very valid use case. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> My understanding of your proposal is that `llvm-profdata >>>>>>>>>>>>>>>>>> show >>>>>>>>>>>>>>>>>> -all-functions foo.profdata` will not show the actual >>>>>>>>>>>>>>>>>> function >>>>>>>>>>>>>>>>>> names >>>>>>>>>>>>>>>>>> but >>>>>>>>>>>>>>>>>> instead MD5 hashes, >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Yes. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> To support your use case, there are two solutions: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> 1) user can add -fcoverage-mapping option in the build >>>>>>>>>>>>>>>>> 2) introduce a new option -fprofile-instr-names that force >>>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>> emission of the name sections in the .o file. This is >>>>>>>>>>>>>>>>> similar >>>>>>>>>>>>>>>>> to >>>>>>>>>>>>>>>>> 1), >>>>>>>>>>>>>>>>> but no covmap section is needed. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> llvm-profdata tool will be taught to read the name section >>>>>>>>>>>>>>>>> and >>>>>>>>>>>>>>>>> attach >>>>>>>>>>>>>>>>> function names to the profile records. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Needing to pass the executable to llvm-profdata would cause >>>>>>>>>>>>>>>> deployment >>>>>>>>>>>>>>>> issues for my customers in practice. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Why? The deployment needs to pass the profile data anyway >>>>>>>>>>>>>>> right? >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> Yes, but not the executable. >>>>>>>>>>>>>> >>>>>>>>>>>>>> The PGO training run is likely being run by a gameplay tester >>>>>>>>>>>>>> (non-programmer). In general the binary will not be lying >>>>>>>>>>>>>> around >>>>>>>>>>>>>> as a >>>>>>>>>>>>>> loose >>>>>>>>>>>>>> file anywhere, it will be part of a full package of the >>>>>>>>>>>>>> binary+assets >>>>>>>>>>>>>> (think >>>>>>>>>>>>>> like what will end up on a bluray disc). A game's binary >>>>>>>>>>>>>> *completely >>>>>>>>>>>>>> useless* without the assets, so except locally on a >>>>>>>>>>>>>> programmer's >>>>>>>>>>>>>> machine >>>>>>>>>>>>>> while they iterate/debug, there is no reason for a binary to >>>>>>>>>>>>>> ever >>>>>>>>>>>>>> exist >>>>>>>>>>>>>> as a >>>>>>>>>>>>>> standalone file. >>>>>>>>>>>>>> >>>>>>>>>>>>>> I'm not saying that needing the binary is insurmountable in any >>>>>>>>>>>>>> particular >>>>>>>>>>>>>> scenario. Just that it will cause a strict increase in the >>>>>>>>>>>>>> number >>>>>>>>>>>>>> of >>>>>>>>>>>>>> issues >>>>>>>>>>>>>> to deploying PGO. >>>>>>>>>>>>> >>>>>>>>>>>>> Your concern is acknowledged. >>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> These are much bigger "compatibility concerns" for me than for >>>>>>>>>>>>>> newer >>>>>>>>>>>>>> toolchains to accept the old format. For a change in format I >>>>>>>>>>>>>> can >>>>>>>>>>>>>> easily >>>>>>>>>>>>>> tell my users to replace an exe with a newer one and that is >>>>>>>>>>>>>> all >>>>>>>>>>>>>> they >>>>>>>>>>>>>> need >>>>>>>>>>>>>> to do and it takes 10 seconds, guaranteed. A workflow change is >>>>>>>>>>>>>> potentially >>>>>>>>>>>>>> a massive disruption and guaranteed to take more than 10 >>>>>>>>>>>>>> seconds >>>>>>>>>>>>>> to >>>>>>>>>>>>>> fix >>>>>>>>>>>>>> (perhaps hours or days). >>>>>>>>>>>>> >>>>>>>>>>>>> ok. >>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> This >>>>>>>>>>>>>>> is no different from llvm-cov usage model. >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> In practice, getting the performance of PGO is a higher >>>>>>>>>>>>>> priority >>>>>>>>>>>>>> for >>>>>>>>>>>>>> my >>>>>>>>>>>>>> users, so we should not assume that llvm-cov is being used. >>>>>>>>>>>>> >>>>>>>>>>>>> Glad to hear that :) >>>>>>>>>>>>> >>>>>>>>>>>>> thanks, >>>>>>>>>>>>> >>>>>>>>>>>>> David >>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> -- Sean Silva >>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> David >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Note that with 1) or 2), the user can still benefit from >>>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>> reduced >>>>>>>>>>>>>>>>> profile size. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Let me reiterate that the size of the profile is not a >>>>>>>>>>>>>>>> problem >>>>>>>>>>>>>>>> I >>>>>>>>>>>>>>>> have >>>>>>>>>>>>>>>> observed in practice (nor have I heard of this being a >>>>>>>>>>>>>>>> problem >>>>>>>>>>>>>>>> in >>>>>>>>>>>>>>>> practice >>>>>>>>>>>>>>>> until this thread). Therefore I'm skeptical of any changes >>>>>>>>>>>>>>>> to >>>>>>>>>>>>>>>> our >>>>>>>>>>>>>>>> default >>>>>>>>>>>>>>>> behavior or any new requirements that are not opt-in. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> -- Sean Silva >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> thanks, >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> David >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> which will make it more difficult for me to do this kind >>>>>>>>>>>>>>>>>> of analysis (would require using nm on the original >>>>>>>>>>>>>>>>>> binary, >>>>>>>>>>>>>>>>>> hashing >>>>>>>>>>>>>>>>>> everything, etc.). >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> btw, feel free to attach the patch even if it in a rough >>>>>>>>>>>>>>>>>> state. >>>>>>>>>>>>>>>>>> It >>>>>>>>>>>>>>>>>> can >>>>>>>>>>>>>>>>>> still >>>>>>>>>>>>>>>>>> help to clarify the proposal and be a good talking point. >>>>>>>>>>>>>>>>>> Fine-grained >>>>>>>>>>>>>>>>>> patch >>>>>>>>>>>>>>>>>> review for caring about the rough parts will happen on >>>>>>>>>>>>>>>>>> llvm-commits; >>>>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>>> rough parts will not distract the discussion here on >>>>>>>>>>>>>>>>>> llvm-dev. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> -- Sean Silva >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> thanks, >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> David >>>>>>>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>>>>>>> LLVM Developers mailing list >>>>>>>>>>>>>>>>>>> llvm-dev at lists.llvm.org >>>>>>>>>>>>>>>>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>> >>>>>>>> >>>>>> >>>>>> >>> _______________________________________________ >>> LLVM Developers mailing list >>> llvm-dev at lists.llvm.org >>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >>