thr3ads.net - llvm dev - [llvm-dev] AutoFDO sample profiles v. SelectInst, [Aug 2016]

If this information is useful, please help other people find it:
Share via:

David Callahan via llvm-dev

2016-Aug-12 17:06 UTC

[llvm-dev] AutoFDO sample profiles v. SelectInst,

I am looking for advice on a problem observed with
-fprofile-sample-use for samples built with the AutoFDO tool

I took the "hmmer" benchmark out of SPEC2006
It is initially compiled

   clnag++ -o hmmer -O3 -std=gnu89 -DSPEC_CPU -DNDEBUG -fno-strict-aliasing -w
-g *.c

This baseline binary runs in about 164.2 seconds as reported by "perf
stat"

We build a sample file from this program using the AutoFDO tool
"create_llvm_prof"

   perf report -b hmmer nph3.hmm swiss41wa
   create_llvm_prof -out hmmer.llvm ...

and rebuild the binary using this profile

   clnag++ -o hmmer-fdo -fprofile-sample-use=hmmer.llvm \
           -O3 -std=gnu89 -DSPEC_CPU -DNDEBUG -fno-strict-aliasing -w -g *.c

now, sadly, this program runs in 231.2 seconds.

The problem is that when a short conditional block is converted to a
SelectInst, we are unable to accurately recover the branch frequencies
since there is no actual branching. When we then compile in the
presence of the sample, phase "CodeGen Prepare" examines the profile
data and undoes the select conversion to disastrous results.

If we compile -O0 for training, and then use the profile now with
accurate branch weights, the program runs in 149.5
seconds. Unfortunately, of course, the training program runs in 501.4
seconds.

Alternately, if we disable the original select conversion performed in
SpeculativelyExecuteBB in SimplifyCFG.cpp so the original control is
visible to sampling, the training program now runs in 229.7 seconds and
the optimized program runs in 151.5, so we recover essentially all of
lost information.

Of course both if these options are unfortunate because they alter the
workflow where it would be preferable to be able to monitor the
production codes to feed back into production builds. That suggests
that we remove the use of profile data in the CodeGen Prepare
phase. When that change is made, and we sample the baseline -O3
binary, the resulting optimized binary runs in 158.9 seconds.

That result is at least slightly better than baseline instead of much
worse but we are leaving 2-3% on the table. Maybe that is a reasonable
trade-off for having only production builds.

Any advice or suggestions?
Thanks
david
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160812/1c25ed87/attachment.html>

Xinliang David Li via llvm-dev

2016-Aug-12 18:15 UTC

head link

[llvm-dev] AutoFDO sample profiles v. SelectInst,

+dehao.

There are two potential problems:

1) the branch gets eliminated in the binary that is being profiled, so
there is no profile data
2) select instruction is lowered into branch -- but the branch profile data
is not annotated back to the select instruction.

2) is something that can be improved in SampleFDO.

On Fri, Aug 12, 2016 at 10:06 AM, David Callahan via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
> I am looking for advice on a problem observed with
> -fprofile-sample-use for samples built with the AutoFDO tool
>
> I took the "hmmer" benchmark out of SPEC2006
> It is initially compiled
>
>    clnag++ -o hmmer -O3 -std=gnu89 -DSPEC_CPU -DNDEBUG
> -fno-strict-aliasing -w -g *.c
>
> This baseline binary runs in about 164.2 seconds as reported by "perf
stat"
>
> We build a sample file from this program using the AutoFDO tool
> "create_llvm_prof"
>
>    perf report -b hmmer nph3.hmm swiss41wa
>
perf record ?

>    create_llvm_prof -out hmmer.llvm ...
>
> and rebuild the binary using this profile
>
>    clnag++ -o hmmer-fdo -fprofile-sample-use=hmmer.llvm \
>            -O3 -std=gnu89 -DSPEC_CPU -DNDEBUG -fno-strict-aliasing -w -g
> *.c
>
> now, sadly, this program runs in 231.2 seconds.
>
> The problem is that when a short conditional block is converted to a
> SelectInst, we are unable to accurately recover the branch frequencies
> since there is no actual branching. When we then compile in the
> presence of the sample, phase "CodeGen Prepare" examines the
profile
> data and undoes the select conversion to disastrous results.
>
>This looks like a bug here -- is it likely that selectInst somehow gets
annotated with bad profile data ? Should it make the same decision as if
autoFDO is not used?

A smaller reproducible will be helpful here.


> If we compile -O0 for training, and then use the profile now with
> accurate branch weights, the program runs in 149.5
> seconds. Unfortunately, of course, the training program runs in 501.4
> seconds.
>
> Alternately, if we disable the original select conversion performed in
> SpeculativelyExecuteBB in SimplifyCFG.cpp so the original control is
> visible to sampling, the training program now runs in 229.7 seconds and
> the optimized program runs in 151.5, so we recover essentially all of
> lost information.
>
> Of course both if these options are unfortunate because they alter the
> workflow where it would be preferable to be able to monitor the
> production codes to feed back into production builds. That suggests
> that we remove the use of profile data in the CodeGen Prepare
> phase. When that change is made, and we sample the baseline -O3
> binary, the resulting optimized binary runs in 158.9 seconds.
>
> That result is at least slightly better than baseline instead of much
> worse but we are leaving 2-3% on the table. Maybe that is a reasonable
> trade-off for having only production builds.
>
> Any advice or suggestions?
>
Please file a bug with something to reproduce : preprocessed file, compiler
command line, and profile data in text form.

David



> Thanks
> david
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160812/3e861dce/attachment.html>

David Callahan via llvm-dev

2016-Aug-15 23:54 UTC

head link

[llvm-dev] AutoFDO sample profiles v. SelectInst,

I field two bugs
https://llvm.org/bugs/show_bug.cgi?id=28990
https://llvm.org/bugs/show_bug.cgi?id=28991

Which appear different but may be related.

From: Xinliang David Li <xinliangli at gmail.com<mailto:xinliangli at
gmail.com>>
Date: Friday, August 12, 2016 at 11:15 AM
To: David Callahan <dcallahan at fb.com<mailto:dcallahan at fb.com>>
Cc: LLVM Dev Mailing list <llvm-dev at lists.llvm.org<mailto:llvm-dev at
lists.llvm.org>>, Dehao Chen <dehao at google.com<mailto:dehao at
google.com>>
Subject: Re: [llvm-dev] AutoFDO sample profiles v. SelectInst,

+dehao.

There are two potential problems:

1) the branch gets eliminated in the binary that is being profiled, so there is
no profile data
2) select instruction is lowered into branch -- but the branch profile data is
not annotated back to the select instruction.

2) is something that can be improved in SampleFDO.

On Fri, Aug 12, 2016 at 10:06 AM, David Callahan via llvm-dev <llvm-dev at
lists.llvm.org<mailto:llvm-dev at lists.llvm.org>> wrote:

I am looking for advice on a problem observed with
-fprofile-sample-use for samples built with the AutoFDO tool

I took the "hmmer" benchmark out of SPEC2006
It is initially compiled

   clnag++ -o hmmer -O3 -std=gnu89 -DSPEC_CPU -DNDEBUG -fno-strict-aliasing -w
-g *.c

This baseline binary runs in about 164.2 seconds as reported by "perf
stat"

We build a sample file from this program using the AutoFDO tool
"create_llvm_prof"

   perf report -b hmmer nph3.hmm swiss41wa

perf record ?

   create_llvm_prof -out hmmer.llvm ...

and rebuild the binary using this profile

   clnag++ -o hmmer-fdo -fprofile-sample-use=hmmer.llvm \
           -O3 -std=gnu89 -DSPEC_CPU -DNDEBUG -fno-strict-aliasing -w -g *.c

now, sadly, this program runs in 231.2 seconds.

The problem is that when a short conditional block is converted to a
SelectInst, we are unable to accurately recover the branch frequencies
since there is no actual branching. When we then compile in the
presence of the sample, phase "CodeGen Prepare" examines the profile
data and undoes the select conversion to disastrous results.

This looks like a bug here -- is it likely that selectInst somehow gets
annotated with bad profile data ? Should it make the same decision as if autoFDO
is not used?

A smaller reproducible will be helpful here.

If we compile -O0 for training, and then use the profile now with
accurate branch weights, the program runs in 149.5
seconds. Unfortunately, of course, the training program runs in 501.4
seconds.

Alternately, if we disable the original select conversion performed in
SpeculativelyExecuteBB in SimplifyCFG.cpp so the original control is
visible to sampling, the training program now runs in 229.7 seconds and
the optimized program runs in 151.5, so we recover essentially all of
lost information.

Of course both if these options are unfortunate because they alter the
workflow where it would be preferable to be able to monitor the
production codes to feed back into production builds. That suggests
that we remove the use of profile data in the CodeGen Prepare
phase. When that change is made, and we sample the baseline -O3
binary, the resulting optimized binary runs in 158.9 seconds.

That result is at least slightly better than baseline instead of much
worse but we are leaving 2-3% on the table. Maybe that is a reasonable
trade-off for having only production builds.

Any advice or suggestions?

Please file a bug with something to reproduce : preprocessed file, compiler
command line, and profile data in text form.

David

Thanks
david

_______________________________________________
LLVM Developers mailing list
llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev<https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.llvm.org_cgi-2Dbin_mailman_listinfo_llvm-2Ddev&d=CwMFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=lFyiPUrFdOHdaobP7i4hoA&m=1vODYD_QOwdjhpMPgi5QwnjODVBWteag3lOcQgImh2k&s=2zUsLdPLscTj76IW8UOtjVjUiAb82ZV-Celctt7UKxc&e=>

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160815/171e9d48/attachment.html>

Sanjay Patel via llvm-dev

2016-Aug-17 15:19 UTC

head link

[llvm-dev] AutoFDO sample profiles v. SelectInst,

On Fri, Aug 12, 2016 at 12:15 PM, Xinliang David Li via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
> +dehao.
>
> There are two potential problems:
>
> 1) the branch gets eliminated in the binary that is being profiled, so
> there is no profile data
>
This seems like a fundamental problem for PGO. Maybe it is also responsible
for this bug: https://llvm.org/bugs/show_bug.cgi?id=27359 ?

Should we limit select optimizations in IR for a PGO-training build? Or
should there be a 'select smasher' pass later in the pipeline that turns
selects into branches for a PGO-training build? (I don't have a good
understanding of PGO, so I'm just throwing out ideas...maybe a better
question is: how do other compilers handle this?)
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160817/c6276df2/attachment.html>

llvm dev - Aug 2016 - AutoFDO sample profiles v. SelectInst,

[llvm-dev] AutoFDO sample profiles v. SelectInst,

[llvm-dev] AutoFDO sample profiles v. SelectInst,

[llvm-dev] AutoFDO sample profiles v. SelectInst,

[llvm-dev] AutoFDO sample profiles v. SelectInst,