David Callahan via llvm-dev
2016-Aug-12 17:06 UTC
[llvm-dev] AutoFDO sample profiles v. SelectInst,
I am looking for advice on a problem observed with -fprofile-sample-use for samples built with the AutoFDO tool I took the "hmmer" benchmark out of SPEC2006 It is initially compiled clnag++ -o hmmer -O3 -std=gnu89 -DSPEC_CPU -DNDEBUG -fno-strict-aliasing -w -g *.c This baseline binary runs in about 164.2 seconds as reported by "perf stat" We build a sample file from this program using the AutoFDO tool "create_llvm_prof" perf report -b hmmer nph3.hmm swiss41wa create_llvm_prof -out hmmer.llvm ... and rebuild the binary using this profile clnag++ -o hmmer-fdo -fprofile-sample-use=hmmer.llvm \ -O3 -std=gnu89 -DSPEC_CPU -DNDEBUG -fno-strict-aliasing -w -g *.c now, sadly, this program runs in 231.2 seconds. The problem is that when a short conditional block is converted to a SelectInst, we are unable to accurately recover the branch frequencies since there is no actual branching. When we then compile in the presence of the sample, phase "CodeGen Prepare" examines the profile data and undoes the select conversion to disastrous results. If we compile -O0 for training, and then use the profile now with accurate branch weights, the program runs in 149.5 seconds. Unfortunately, of course, the training program runs in 501.4 seconds. Alternately, if we disable the original select conversion performed in SpeculativelyExecuteBB in SimplifyCFG.cpp so the original control is visible to sampling, the training program now runs in 229.7 seconds and the optimized program runs in 151.5, so we recover essentially all of lost information. Of course both if these options are unfortunate because they alter the workflow where it would be preferable to be able to monitor the production codes to feed back into production builds. That suggests that we remove the use of profile data in the CodeGen Prepare phase. When that change is made, and we sample the baseline -O3 binary, the resulting optimized binary runs in 158.9 seconds. That result is at least slightly better than baseline instead of much worse but we are leaving 2-3% on the table. Maybe that is a reasonable trade-off for having only production builds. Any advice or suggestions? Thanks david -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160812/1c25ed87/attachment.html>
Xinliang David Li via llvm-dev
2016-Aug-12 18:15 UTC
[llvm-dev] AutoFDO sample profiles v. SelectInst,
+dehao. There are two potential problems: 1) the branch gets eliminated in the binary that is being profiled, so there is no profile data 2) select instruction is lowered into branch -- but the branch profile data is not annotated back to the select instruction. 2) is something that can be improved in SampleFDO. On Fri, Aug 12, 2016 at 10:06 AM, David Callahan via llvm-dev < llvm-dev at lists.llvm.org> wrote:> I am looking for advice on a problem observed with > -fprofile-sample-use for samples built with the AutoFDO tool > > I took the "hmmer" benchmark out of SPEC2006 > It is initially compiled > > clnag++ -o hmmer -O3 -std=gnu89 -DSPEC_CPU -DNDEBUG > -fno-strict-aliasing -w -g *.c > > This baseline binary runs in about 164.2 seconds as reported by "perf stat" > > We build a sample file from this program using the AutoFDO tool > "create_llvm_prof" > > perf report -b hmmer nph3.hmm swiss41wa >perf record ?> create_llvm_prof -out hmmer.llvm ... > > and rebuild the binary using this profile > > clnag++ -o hmmer-fdo -fprofile-sample-use=hmmer.llvm \ > -O3 -std=gnu89 -DSPEC_CPU -DNDEBUG -fno-strict-aliasing -w -g > *.c > > now, sadly, this program runs in 231.2 seconds. > > The problem is that when a short conditional block is converted to a > SelectInst, we are unable to accurately recover the branch frequencies > since there is no actual branching. When we then compile in the > presence of the sample, phase "CodeGen Prepare" examines the profile > data and undoes the select conversion to disastrous results. > >This looks like a bug here -- is it likely that selectInst somehow gets annotated with bad profile data ? Should it make the same decision as if autoFDO is not used? A smaller reproducible will be helpful here.> If we compile -O0 for training, and then use the profile now with > accurate branch weights, the program runs in 149.5 > seconds. Unfortunately, of course, the training program runs in 501.4 > seconds. > > Alternately, if we disable the original select conversion performed in > SpeculativelyExecuteBB in SimplifyCFG.cpp so the original control is > visible to sampling, the training program now runs in 229.7 seconds and > the optimized program runs in 151.5, so we recover essentially all of > lost information. > > Of course both if these options are unfortunate because they alter the > workflow where it would be preferable to be able to monitor the > production codes to feed back into production builds. That suggests > that we remove the use of profile data in the CodeGen Prepare > phase. When that change is made, and we sample the baseline -O3 > binary, the resulting optimized binary runs in 158.9 seconds. > > That result is at least slightly better than baseline instead of much > worse but we are leaving 2-3% on the table. Maybe that is a reasonable > trade-off for having only production builds. > > Any advice or suggestions? >Please file a bug with something to reproduce : preprocessed file, compiler command line, and profile data in text form. David> Thanks > david > > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160812/3e861dce/attachment.html>
David Callahan via llvm-dev
2016-Aug-15 23:54 UTC
[llvm-dev] AutoFDO sample profiles v. SelectInst,
I field two bugs https://llvm.org/bugs/show_bug.cgi?id=28990 https://llvm.org/bugs/show_bug.cgi?id=28991 Which appear different but may be related. From: Xinliang David Li <xinliangli at gmail.com<mailto:xinliangli at gmail.com>> Date: Friday, August 12, 2016 at 11:15 AM To: David Callahan <dcallahan at fb.com<mailto:dcallahan at fb.com>> Cc: LLVM Dev Mailing list <llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>>, Dehao Chen <dehao at google.com<mailto:dehao at google.com>> Subject: Re: [llvm-dev] AutoFDO sample profiles v. SelectInst, +dehao. There are two potential problems: 1) the branch gets eliminated in the binary that is being profiled, so there is no profile data 2) select instruction is lowered into branch -- but the branch profile data is not annotated back to the select instruction. 2) is something that can be improved in SampleFDO. On Fri, Aug 12, 2016 at 10:06 AM, David Callahan via llvm-dev <llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>> wrote: I am looking for advice on a problem observed with -fprofile-sample-use for samples built with the AutoFDO tool I took the "hmmer" benchmark out of SPEC2006 It is initially compiled clnag++ -o hmmer -O3 -std=gnu89 -DSPEC_CPU -DNDEBUG -fno-strict-aliasing -w -g *.c This baseline binary runs in about 164.2 seconds as reported by "perf stat" We build a sample file from this program using the AutoFDO tool "create_llvm_prof" perf report -b hmmer nph3.hmm swiss41wa perf record ? create_llvm_prof -out hmmer.llvm ... and rebuild the binary using this profile clnag++ -o hmmer-fdo -fprofile-sample-use=hmmer.llvm \ -O3 -std=gnu89 -DSPEC_CPU -DNDEBUG -fno-strict-aliasing -w -g *.c now, sadly, this program runs in 231.2 seconds. The problem is that when a short conditional block is converted to a SelectInst, we are unable to accurately recover the branch frequencies since there is no actual branching. When we then compile in the presence of the sample, phase "CodeGen Prepare" examines the profile data and undoes the select conversion to disastrous results. This looks like a bug here -- is it likely that selectInst somehow gets annotated with bad profile data ? Should it make the same decision as if autoFDO is not used? A smaller reproducible will be helpful here. If we compile -O0 for training, and then use the profile now with accurate branch weights, the program runs in 149.5 seconds. Unfortunately, of course, the training program runs in 501.4 seconds. Alternately, if we disable the original select conversion performed in SpeculativelyExecuteBB in SimplifyCFG.cpp so the original control is visible to sampling, the training program now runs in 229.7 seconds and the optimized program runs in 151.5, so we recover essentially all of lost information. Of course both if these options are unfortunate because they alter the workflow where it would be preferable to be able to monitor the production codes to feed back into production builds. That suggests that we remove the use of profile data in the CodeGen Prepare phase. When that change is made, and we sample the baseline -O3 binary, the resulting optimized binary runs in 158.9 seconds. That result is at least slightly better than baseline instead of much worse but we are leaving 2-3% on the table. Maybe that is a reasonable trade-off for having only production builds. Any advice or suggestions? Please file a bug with something to reproduce : preprocessed file, compiler command line, and profile data in text form. David Thanks david _______________________________________________ LLVM Developers mailing list llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev<https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.llvm.org_cgi-2Dbin_mailman_listinfo_llvm-2Ddev&d=CwMFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=lFyiPUrFdOHdaobP7i4hoA&m=1vODYD_QOwdjhpMPgi5QwnjODVBWteag3lOcQgImh2k&s=2zUsLdPLscTj76IW8UOtjVjUiAb82ZV-Celctt7UKxc&e=> -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160815/171e9d48/attachment.html>
Sanjay Patel via llvm-dev
2016-Aug-17 15:19 UTC
[llvm-dev] AutoFDO sample profiles v. SelectInst,
On Fri, Aug 12, 2016 at 12:15 PM, Xinliang David Li via llvm-dev < llvm-dev at lists.llvm.org> wrote:> +dehao. > > There are two potential problems: > > 1) the branch gets eliminated in the binary that is being profiled, so > there is no profile data >This seems like a fundamental problem for PGO. Maybe it is also responsible for this bug: https://llvm.org/bugs/show_bug.cgi?id=27359 ? Should we limit select optimizations in IR for a PGO-training build? Or should there be a 'select smasher' pass later in the pipeline that turns selects into branches for a PGO-training build? (I don't have a good understanding of PGO, so I'm just throwing out ideas...maybe a better question is: how do other compilers handle this?) -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160817/c6276df2/attachment.html>