Dehao Chen via llvm-dev
2017-Jan-31 23:20 UTC
[llvm-dev] (RFC) Adjusting default loop fully unroll threshold
Recollected the data from trunk head with stddev data and more threshold data points attached: Performance: stddev/mean 300 450 600 750 403 0.37% 0.11% 0.11% 0.09% 0.79% 433 0.14% 0.51% 0.25% -0.63% -0.29% 445 0.08% 0.48% 0.89% 0.12% 0.83% 447 0.16% 3.50% 2.69% 3.66% 3.59% 453 0.11% 1.49% 0.45% -0.07% 0.78% 464 0.17% 0.75% 1.80% 1.86% 1.54% Code size: 300 450 600 750 403 0.56% 2.41% 2.74% 3.75% 433 0.96% 2.84% 4.19% 4.87% 445 2.16% 3.62% 4.48% 5.88% 447 2.96% 5.09% 6.74% 8.89% 453 0.94% 1.67% 2.73% 2.96% 464 8.02% 13.50% 20.51% 26.59% Compile time is proportional in the experiments and more noisy, so I did not include it. We have >2% speedup on some google internal benchmarks when switching the threshold from 150 to 300. Dehao On Mon, Jan 30, 2017 at 5:06 PM, Chandler Carruth <chandlerc at google.com> wrote:> On Mon, Jan 30, 2017 at 4:59 PM Mehdi Amini <mehdi.amini at apple.com> wrote: > >> >> >> Another question is about PGO integration: is it already hooked there? >> Should we have a more aggressive threshold in a hot function? (Assuming >> we’re willing to spend some binary size there but not on the cold path). >> >> >> I would even wire the *unrolling* the other way: just suppress unrolling >> in cold paths to save binary size. rolled loops seem like a generally good >> thing in cold code unless they are having some larger impact (IE, the loop >> itself is more expensive than the unrolled form). >> >> >> >> Agree that we could suppress unrolling in cold path to save code size. >> But that's orthogonal with the propose here. This proposal focuses on O2 >> performance: shall we have different (higher) fully unroll threshold than >> dynamic/partial unroll. >> >> >> I agree that this is (to some extent) orthogonal, and it makes sense to >> me to differentiate the threshold for full unroll and the dynamic/partial >> case. >> > > There is one issue that makes these not orthogonal. > > If even *static* profile hints will reduce some of the code size increase > caused by higher unrolling thresholds for non-cold code, we should factor > that into the tradeoff in picking where the threshold goes. > > However, getting PGO into the full unroller is currently challenging > outside of the new pass manager. We already have some unfortunate hacks > around this in LoopUnswitch that are making the port of it to the new PM > more annoying. > >>-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170131/3299bb23/attachment.html>
Dehao Chen via llvm-dev
2017-Feb-02 00:33 UTC
[llvm-dev] (RFC) Adjusting default loop fully unroll threshold
With the new data points, any comments on whether this can justify setting fully inline threshold to 300 (or any other number) in O2? I can collect more data points if it's helpful. Thanks, Dehao On Tue, Jan 31, 2017 at 3:20 PM, Dehao Chen <dehao at google.com> wrote:> Recollected the data from trunk head with stddev data and more threshold > data points attached: > > Performance: > > stddev/mean 300 450 600 750 > 403 0.37% 0.11% 0.11% 0.09% 0.79% > 433 0.14% 0.51% 0.25% -0.63% -0.29% > 445 0.08% 0.48% 0.89% 0.12% 0.83% > 447 0.16% 3.50% 2.69% 3.66% 3.59% > 453 0.11% 1.49% 0.45% -0.07% 0.78% > 464 0.17% 0.75% 1.80% 1.86% 1.54% > Code size: > > 300 450 600 750 > 403 0.56% 2.41% 2.74% 3.75% > 433 0.96% 2.84% 4.19% 4.87% > 445 2.16% 3.62% 4.48% 5.88% > 447 2.96% 5.09% 6.74% 8.89% > 453 0.94% 1.67% 2.73% 2.96% > 464 8.02% 13.50% 20.51% 26.59% > Compile time is proportional in the experiments and more noisy, so I did > not include it. > > We have >2% speedup on some google internal benchmarks when switching the > threshold from 150 to 300. > > Dehao > > On Mon, Jan 30, 2017 at 5:06 PM, Chandler Carruth <chandlerc at google.com> > wrote: > >> On Mon, Jan 30, 2017 at 4:59 PM Mehdi Amini <mehdi.amini at apple.com> >> wrote: >> >>> >>> >>> Another question is about PGO integration: is it already hooked there? >>> Should we have a more aggressive threshold in a hot function? (Assuming >>> we’re willing to spend some binary size there but not on the cold path). >>> >>> >>> I would even wire the *unrolling* the other way: just suppress unrolling >>> in cold paths to save binary size. rolled loops seem like a generally good >>> thing in cold code unless they are having some larger impact (IE, the loop >>> itself is more expensive than the unrolled form). >>> >>> >>> >>> Agree that we could suppress unrolling in cold path to save code size. >>> But that's orthogonal with the propose here. This proposal focuses on O2 >>> performance: shall we have different (higher) fully unroll threshold than >>> dynamic/partial unroll. >>> >>> >>> I agree that this is (to some extent) orthogonal, and it makes sense to >>> me to differentiate the threshold for full unroll and the dynamic/partial >>> case. >>> >> >> There is one issue that makes these not orthogonal. >> >> If even *static* profile hints will reduce some of the code size increase >> caused by higher unrolling thresholds for non-cold code, we should factor >> that into the tradeoff in picking where the threshold goes. >> >> However, getting PGO into the full unroller is currently challenging >> outside of the new pass manager. We already have some unfortunate hacks >> around this in LoopUnswitch that are making the port of it to the new PM >> more annoying. >> >>> >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170201/f1efba7f/attachment.html>
Chandler Carruth via llvm-dev
2017-Feb-02 00:47 UTC
[llvm-dev] (RFC) Adjusting default loop fully unroll threshold
I had suggested having size metrics from somewhat larger applications such as Chrome, Webkit, or Firefox; clang itself; and maybe some of our internal binaries with rough size brackets? On Wed, Feb 1, 2017 at 4:33 PM Dehao Chen <dehao at google.com> wrote:> With the new data points, any comments on whether this can justify setting > fully inline threshold to 300 (or any other number) in O2? I can collect > more data points if it's helpful. > > Thanks, > Dehao > > On Tue, Jan 31, 2017 at 3:20 PM, Dehao Chen <dehao at google.com> wrote: > > Recollected the data from trunk head with stddev data and more threshold > data points attached: > > Performance: > > stddev/mean 300 450 600 750 > 403 0.37% 0.11% 0.11% 0.09% 0.79% > 433 0.14% 0.51% 0.25% -0.63% -0.29% > 445 0.08% 0.48% 0.89% 0.12% 0.83% > 447 0.16% 3.50% 2.69% 3.66% 3.59% > 453 0.11% 1.49% 0.45% -0.07% 0.78% > 464 0.17% 0.75% 1.80% 1.86% 1.54% > Code size: > > 300 450 600 750 > 403 0.56% 2.41% 2.74% 3.75% > 433 0.96% 2.84% 4.19% 4.87% > 445 2.16% 3.62% 4.48% 5.88% > 447 2.96% 5.09% 6.74% 8.89% > 453 0.94% 1.67% 2.73% 2.96% > 464 8.02% 13.50% 20.51% 26.59% > Compile time is proportional in the experiments and more noisy, so I did > not include it. > > We have >2% speedup on some google internal benchmarks when switching the > threshold from 150 to 300. > > Dehao > > On Mon, Jan 30, 2017 at 5:06 PM, Chandler Carruth <chandlerc at google.com> > wrote: > > On Mon, Jan 30, 2017 at 4:59 PM Mehdi Amini <mehdi.amini at apple.com> wrote: > > > > Another question is about PGO integration: is it already hooked there? > Should we have a more aggressive threshold in a hot function? (Assuming > we’re willing to spend some binary size there but not on the cold path). > > > I would even wire the *unrolling* the other way: just suppress unrolling > in cold paths to save binary size. rolled loops seem like a generally good > thing in cold code unless they are having some larger impact (IE, the loop > itself is more expensive than the unrolled form). > > > > Agree that we could suppress unrolling in cold path to save code size. But > that's orthogonal with the propose here. This proposal focuses on O2 > performance: shall we have different (higher) fully unroll threshold than > dynamic/partial unroll. > > > I agree that this is (to some extent) orthogonal, and it makes sense to me > to differentiate the threshold for full unroll and the dynamic/partial case. > > > There is one issue that makes these not orthogonal. > > If even *static* profile hints will reduce some of the code size increase > caused by higher unrolling thresholds for non-cold code, we should factor > that into the tradeoff in picking where the threshold goes. > > However, getting PGO into the full unroller is currently challenging > outside of the new pass manager. We already have some unfortunate hacks > around this in LoopUnswitch that are making the port of it to the new PM > more annoying. > > > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170202/c9b67e49/attachment.html>
Possibly Parallel Threads
- (RFC) Adjusting default loop fully unroll threshold
- (RFC) Adjusting default loop fully unroll threshold
- (RFC) Adjusting default loop fully unroll threshold
- (RFC) Adjusting default loop fully unroll threshold
- (RFC) Adjusting default loop fully unroll threshold