thr3ads.net - llvm dev - [llvm-dev] About Clang llvm PGO [May 2016]

If this information is useful, please help other people find it:
Share via:

Xinliang David Li via llvm-dev

2016-May-07 00:06 UTC

[llvm-dev] About Clang llvm PGO

Thanks for testing out LLVM PGO and evaluated the performance.

We are currently still more focused on infrastructure improvement which is
the foundation for performance improvement.  We are making great progress
in this direction, but there are still some key missing pieces such as
profile data in inliner etc. We are working on that. Once those are done,
more focus will be on making more passes profile aware, make existing
profile aware passes better (e.g, code layout etc).

I looked at this particular example. GCC PGO can reduce the runtime by
half, while LLVM's PGO makes no performance difference as you noticed.

For GCC case, PGO itself contributes about 15% performance boost. The
majority of the performance improvement comes from loop vectorization. Note
that trunk GCC does not turn on vectorization at O2, but O3 or O2 with PGO.

LLVM also vectorizes the key loops. However compared with GCC's vectorizor,
LLVM's auto-vectorizer produces worse code (e.g, long sequence of
instructions to do sign extension etc): ~6.5instr/iter vs ~9instr/iter.
GCC also does loop unroll after vectorization which also helped a little
more.   LLVM's vectorization actually hurts performance a little.

We will look into this issue.

thanks,

David

On Fri, May 6, 2016 at 2:04 PM, Jie Chen <Jie.Chen at mathworks.com>
wrote:
> Hi David,
>
> I am a performance engineer from MathWorks. I am currently exploring
> building our products with PGO on the Mac platform. While searching for
> llvm PGO solutions, I came across your name many times. So I thought you
> were probably the guy behind llvm’s PGO implementation! :-) Here is what
> confused me regarding the llvm PGO capability. I started with a small code
> (see my code at the end of this email) which I saw more than 10%
> performance improvement with PGO on Linux GCC (g++ -O2, -profile-geneate,
> -profile-use). I wrote this code based on the assumption that llvm would
> rearrange the hot/code branches based on profile run. But when tried with
> Apple Clang and Clang on ubuntu, I did not see any performance improvement.
> Since I do not know the implementation detail of llvm PGO, I am confused by
> not seeing performance improvement as I saw it with GCC (probably with
> Visual Studio PGO as well). Could you please offer me some insights into
> the issue? Or on a further question, what kind of code would benefit from
> llvm PGO optimization?
>
> Best,
>
> Jie Chen
> MathWorks
>
>
> #include <iostream>
>
> #include <stdlib.h>
>
>
> using namespace std;
>
>
> long long hot() {
>
>     long long x = 0;
>
>
>     for (int i = 0; i < 1000; i++) {
>
>         x += i^2;
>
>     }
>
>
>     return x;
>
> }
>
>
> long long cold() {
>
>     long long y = 0;
>
>
>     for (int i = 0; i < 1000; i++) {
>
>         y += i^2;
>
>     }
>
>
>     return y;
>
>
> }
>
>
> long long foo() {
>
>     long long y = 0;
>
>
>     for (int i = 0; i < 1000; i++) {
>
>         y *= i^2;
>
>     }
>
>
>     return y*2;
>
>
> }
>
>
> long long bar() {
>
>     long long y = 0;
>
>
>     for (int i = 0; i < 1000; i++) {
>
>         y *= i^2;
>
>     }
>
>
>     return y*3;
>
>
> }
>
>
> #define SIZE 10000000
>
>
> int main() {
>
>
>     int* a = (int *)calloc(SIZE, sizeof(int));
>
>
>     a[100] = 1;
>
>
>     long long sum = 0;
>
>
>     for (int i = 0; i < SIZE; i++) {
>
>         if (a[i] == 1) {
>
>             sum += cold();
>
>         } else if (a[i] > 1) {
>
>             sum += bar();
>
>             sum += foo();
>
>         } else if (a[i] < 1) {
>
>             sum += hot();
>
>         }
>
>     }
>
>
>     cout << sum << endl;
>
>
>     return 0;
>
> }
>
>
> Makefile to compile the above code on Mac:
>
>
> .PHONY: clean
>
>
> regular: main.cpp
>
>     clang++ -O2  main.cpp -o main.regular
>
>
> hand: main2.cpp
>
>     clang++ -O2  main2.cpp -o main.regular2
>
>
> instr: main.cpp
>
>     clang++ -O2 -fprofile-instr-generate main.cpp -o main.instr
>
>
> profile: main.instr
>
>     ./main.instr
>
>
> merge: default.profraw
>
>     xcrun llvm-profdata merge -output default.profdata default.profraw
>
>
> optimize: default.profdata
>
>     clang++ -O2 -fprofile-instr-use=default.profdata main.cpp -o
> main.optimized
>
>
> clean:
>
>     $(RM) default.* main.instr main.optimized main.regular
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160506/9a6141fd/attachment-0001.html>

Jie Chen via llvm-dev

2016-May-08 21:14 UTC

head link

[llvm-dev] About Clang llvm PGO

Hi David,


Thanks for your great explanations not only covering llvm but also gcc! To
understand the code layout optimization better, I slightly changed my code,
basically, calling the hot() function in the first if-branch instead of at the
last else branch (see my modified code below). This essentially reduces branch
instructions being executed, and possibly improves the branch predictor
performance. On my Mac, I got ~6% performance improvement (clang++ -O2) with
this code change. Looking at the default.profraw data, I can see it has the
information that the optimizer could use to make a similar optimization as my
manual approach. I was hoping llvm PGO could do the same thing.


I am excited to hear from you that more infrastructure changes are undergoing
which will  improve the PGO support. So as for now, what is the list of PGO
optimizations that I can write some code and see immediate improvement from
llvm? It would be great to know such details. :-)


Best,


Jie


//main2.cpp: manual reordering of branches

#include <iostream>
#include <stdlib.h>

using namespace std;

long long hot() {
long long x = 0;

for (int i = 0; i < 1000; i++) {
x += i^2;
}

return x;
}

long long cold() {
long long y = 0;

for (int i = 0; i < 1000; i++) {
y += i^2;
}

return y;

}

long long foo() {
long long y = 0;

for (int i = 0; i < 1000; i++) {
y *= i^2;
}

return y*2;

}

long long bar() {
long long y = 0;

for (int i = 0; i < 1000; i++) {
y *= i^2;
}

return y*3;

}

#define SIZE 10000000

int main() {

int* a = (int *)calloc(SIZE, sizeof(int));

a[100] = 1;

long long sum = 0;

for (int i = 0; i < SIZE; i++) {
if (a[i] < 1) {
sum += hot();
} else if (a[i] == 1) {
sum += cold();
} else if (a[i] < 1) {
sum += bar();
sum += foo();
}
}
cout << sum << endl;
return 0;
}









________________________________
From: Xinliang David Li <davidxl at google.com>
Sent: Friday, May 6, 2016 8:06 PM
To: Jie Chen
Cc: llvm-dev
Subject: Re: About Clang llvm PGO

Thanks for testing out LLVM PGO and evaluated the performance.

We are currently still more focused on infrastructure improvement which is the
foundation for performance improvement.  We are making great progress in this
direction, but there are still some key missing pieces such as profile data in
inliner etc. We are working on that. Once those are done, more focus will be on
making more passes profile aware, make existing profile aware passes better
(e.g, code layout etc).

I looked at this particular example. GCC PGO can reduce the runtime by half,
while LLVM's PGO makes no performance difference as you noticed.

For GCC case, PGO itself contributes about 15% performance boost. The majority
of the performance improvement comes from loop vectorization. Note that trunk
GCC does not turn on vectorization at O2, but O3 or O2 with PGO.

LLVM also vectorizes the key loops. However compared with GCC's vectorizor,
LLVM's auto-vectorizer produces worse code (e.g, long sequence of
instructions to do sign extension etc): ~6.5instr/iter vs ~9instr/iter.  GCC
also does loop unroll after vectorization which also helped a little more.  
LLVM's vectorization actually hurts performance a little.

We will look into this issue.

thanks,

David

On Fri, May 6, 2016 at 2:04 PM, Jie Chen <Jie.Chen at
mathworks.com<mailto:Jie.Chen at mathworks.com>> wrote:
Hi David,

I am a performance engineer from MathWorks. I am currently exploring building
our products with PGO on the Mac platform. While searching for llvm PGO
solutions, I came across your name many times. So I thought you were probably
the guy behind llvm’s PGO implementation! :-) Here is what confused me regarding
the llvm PGO capability. I started with a small code (see my code at the end of
this email) which I saw more than 10% performance improvement with PGO on Linux
GCC (g++ -O2, -profile-geneate, -profile-use). I wrote this code based on the
assumption that llvm would rearrange the hot/code branches based on profile run.
But when tried with Apple Clang and Clang on ubuntu, I did not see any
performance improvement. Since I do not know the implementation detail of llvm
PGO, I am confused by not seeing performance improvement as I saw it with GCC
(probably with Visual Studio PGO as well). Could you please offer me some
insights into the issue? Or on a further question, what kind of code would
benefit from llvm PGO optimization?

Best,

Jie Chen
MathWorks



#include <iostream>

#include <stdlib.h>


using namespace std;


long long hot() {

    long long x = 0;


    for (int i = 0; i < 1000; i++) {

        x += i^2;

    }


    return x;

}


long long cold() {

    long long y = 0;


    for (int i = 0; i < 1000; i++) {

        y += i^2;

    }


    return y;


}


long long foo() {

    long long y = 0;


    for (int i = 0; i < 1000; i++) {

        y *= i^2;

    }


    return y*2;


}


long long bar() {

    long long y = 0;


    for (int i = 0; i < 1000; i++) {

        y *= i^2;

    }


    return y*3;


}


#define SIZE 10000000


int main() {


    int* a = (int *)calloc(SIZE, sizeof(int));


    a[100] = 1;


    long long sum = 0;


    for (int i = 0; i < SIZE; i++) {

        if (a[i] == 1) {

            sum += cold();

        } else if (a[i] > 1) {

            sum += bar();

            sum += foo();

        } else if (a[i] < 1) {

            sum += hot();

        }

    }


    cout << sum << endl;


    return 0;

}


Makefile to compile the above code on Mac:


.PHONY: clean


regular: main.cpp

    clang++ -O2  main.cpp -o main.regular


hand: main2.cpp

    clang++ -O2  main2.cpp -o main.regular2


instr: main.cpp

    clang++ -O2 -fprofile-instr-generate main.cpp -o main.instr


profile: main.instr

    ./main.instr


merge: default.profraw

    xcrun llvm-profdata merge -output default.profdata default.profraw


optimize: default.profdata

    clang++ -O2 -fprofile-instr-use=default.profdata main.cpp -o main.optimized


clean:

    $(RM) default.* main.instr main.optimized main.regular

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160508/f47fb5f7/attachment.html>

Xinliang David Li via llvm-dev

2016-May-08 23:59 UTC

head link

[llvm-dev] About Clang llvm PGO

On Sun, May 8, 2016 at 2:14 PM, Jie Chen <Jie.Chen at mathworks.com>
wrote:
> Hi David,
>
>
> Thanks for your great explanations not only covering llvm but also gcc! To
> understand the code layout optimization better, I slightly changed my code,
> basically, calling the hot() function in the first if-branch instead of at
> the last else branch (see my modified code below). This essentially reduces
> branch instructions being executed, and possibly improves the branch
> predictor performance. On my Mac, I got ~6% performance improvement
> (clang++ -O2) with this code change. Looking at the default.profraw data, I
> can see it has the information that the optimizer could use to make a
> similar optimization as my manual approach. I was hoping llvm PGO could
> do the same thing.
>
yes -- this is a missing profile guided control flow optimization --
reducing hot path's control-dependence height by branch re-ordering --
possible when branch conditions are mutually exclusive.


> I am excited to hear from you that more infrastructure changes are
> undergoing which will  improve the PGO support. So as for now, what is the
> list of PGO optimizations that I can write some code and see
> immediate improvement from llvm? It would be great to know such details.
:-)
>
>
> What I can tell you is that there are many missing ones (that can benefitfrom profile): such as profile aware LICM (patch pending), speculative PRE,
loop unrolling, loop peeling, auto vectorization, inlining, function
splitting, function layout, function outlinling,  profile driven size
optimization, induction variable optimization/strength reduction, stringOp
specialization/optimization/inlining, switch peeling/lowering etc. The
biggest profile user today include ralloc, BB layout, ifcvt, shrinkwrapping
etc, but there should be rooms to be improvement there too.

thanks,

David
> Best,
>
>
> Jie
>
>
> //main2.cpp: manual reordering of branches
> #include <iostream>
> #include <stdlib.h>
>
> using namespace std;
>
> long long hot() {
> long long x = 0;
>
> for (int i = 0; i < 1000; i++) {
> x += i^2;
> }
>
> return x;
> }
>
> long long cold() {
> long long y = 0;
>
> for (int i = 0; i < 1000; i++) {
> y += i^2;
> }
>
> return y;
>
> }
>
> long long foo() {
> long long y = 0;
>
> for (int i = 0; i < 1000; i++) {
> y *= i^2;
> }
>
> return y*2;
>
> }
>
> long long bar() {
> long long y = 0;
>
> for (int i = 0; i < 1000; i++) {
> y *= i^2;
> }
>
> return y*3;
>
> }
>
> #define SIZE 10000000
>
> int main() {
>
> int* a = (int *)calloc(SIZE, sizeof(int));
>
> a[100] = 1;
>
> long long sum = 0;
>
> for (int i = 0; i < SIZE; i++) {
> if (a[i] < 1) {
> sum += hot();
> } else if (a[i] == 1) {
> sum += cold();
> } else if (a[i] < 1) {
> sum += bar();
> sum += foo();
> }
> }
> cout << sum << endl;
> return 0;
> }
>
>
>
>
>
>
>
>
> ------------------------------
> *From:* Xinliang David Li <davidxl at google.com>
> *Sent:* Friday, May 6, 2016 8:06 PM
> *To:* Jie Chen
> *Cc:* llvm-dev
> *Subject:* Re: About Clang llvm PGO
>
> Thanks for testing out LLVM PGO and evaluated the performance.
>
> We are currently still more focused on infrastructure improvement which is
> the foundation for performance improvement.  We are making great progress
> in this direction, but there are still some key missing pieces such as
> profile data in inliner etc. We are working on that. Once those are done,
> more focus will be on making more passes profile aware, make existing
> profile aware passes better (e.g, code layout etc).
>
> I looked at this particular example. GCC PGO can reduce the runtime by
> half, while LLVM's PGO makes no performance difference as you noticed.
>
> For GCC case, PGO itself contributes about 15% performance boost. The
> majority of the performance improvement comes from loop vectorization. Note
> that trunk GCC does not turn on vectorization at O2, but O3 or O2 with PGO.
>
> LLVM also vectorizes the key loops. However compared with GCC's
> vectorizor, LLVM's auto-vectorizer produces worse code (e.g, long
sequence
> of instructions to do sign extension etc): ~6.5instr/iter vs ~9instr/iter.
> GCC also does loop unroll after vectorization which also helped a little
> more.   LLVM's vectorization actually hurts performance a little.
>
> We will look into this issue.
>
> thanks,
>
> David
>
> On Fri, May 6, 2016 at 2:04 PM, Jie Chen <Jie.Chen at mathworks.com>
wrote:
>
>> Hi David,
>>
>> I am a performance engineer from MathWorks. I am currently exploring
>> building our products with PGO on the Mac platform. While searching for
>> llvm PGO solutions, I came across your name many times. So I thought
you
>> were probably the guy behind llvm’s PGO implementation! :-) Here is
what
>> confused me regarding the llvm PGO capability. I started with a small
code
>> (see my code at the end of this email) which I saw more than 10%
>> performance improvement with PGO on Linux GCC (g++ -O2,
-profile-geneate,
>> -profile-use). I wrote this code based on the assumption that llvm
would
>> rearrange the hot/code branches based on profile run. But when tried
with
>> Apple Clang and Clang on ubuntu, I did not see any performance
improvement.
>> Since I do not know the implementation detail of llvm PGO, I am
confused by
>> not seeing performance improvement as I saw it with GCC (probably with
>> Visual Studio PGO as well). Could you please offer me some insights
into
>> the issue? Or on a further question, what kind of code would benefit
from
>> llvm PGO optimization?
>>
>> Best,
>>
>> Jie Chen
>> MathWorks
>>
>>
>> #include <iostream>
>>
>> #include <stdlib.h>
>>
>>
>> using namespace std;
>>
>>
>> long long hot() {
>>
>>     long long x = 0;
>>
>>
>>     for (int i = 0; i < 1000; i++) {
>>
>>         x += i^2;
>>
>>     }
>>
>>
>>     return x;
>>
>> }
>>
>>
>> long long cold() {
>>
>>     long long y = 0;
>>
>>
>>     for (int i = 0; i < 1000; i++) {
>>
>>         y += i^2;
>>
>>     }
>>
>>
>>     return y;
>>
>>
>> }
>>
>>
>> long long foo() {
>>
>>     long long y = 0;
>>
>>
>>     for (int i = 0; i < 1000; i++) {
>>
>>         y *= i^2;
>>
>>     }
>>
>>
>>     return y*2;
>>
>>
>> }
>>
>>
>> long long bar() {
>>
>>     long long y = 0;
>>
>>
>>     for (int i = 0; i < 1000; i++) {
>>
>>         y *= i^2;
>>
>>     }
>>
>>
>>     return y*3;
>>
>>
>> }
>>
>>
>> #define SIZE 10000000
>>
>>
>> int main() {
>>
>>
>>     int* a = (int *)calloc(SIZE, sizeof(int));
>>
>>
>>     a[100] = 1;
>>
>>
>>     long long sum = 0;
>>
>>
>>     for (int i = 0; i < SIZE; i++) {
>>
>>         if (a[i] == 1) {
>>
>>             sum += cold();
>>
>>         } else if (a[i] > 1) {
>>
>>             sum += bar();
>>
>>             sum += foo();
>>
>>         } else if (a[i] < 1) {
>>
>>             sum += hot();
>>
>>         }
>>
>>     }
>>
>>
>>     cout << sum << endl;
>>
>>
>>     return 0;
>>
>> }
>>
>>
>> Makefile to compile the above code on Mac:
>>
>>
>> .PHONY: clean
>>
>>
>> regular: main.cpp
>>
>>     clang++ -O2  main.cpp -o main.regular
>>
>>
>> hand: main2.cpp
>>
>>     clang++ -O2  main2.cpp -o main.regular2
>>
>>
>> instr: main.cpp
>>
>>     clang++ -O2 -fprofile-instr-generate main.cpp -o main.instr
>>
>>
>> profile: main.instr
>>
>>     ./main.instr
>>
>>
>> merge: default.profraw
>>
>>     xcrun llvm-profdata merge -output default.profdata default.profraw
>>
>>
>> optimize: default.profdata
>>
>>     clang++ -O2 -fprofile-instr-use=default.profdata main.cpp -o
>> main.optimized
>>
>>
>> clean:
>>
>>     $(RM) default.* main.instr main.optimized main.regular
>>
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160508/c4577e91/attachment.html>

Reasonably Related Threads

Search for more maybe matching threads

llvm dev - May 2016 - About Clang llvm PGO

[llvm-dev] About Clang llvm PGO

[llvm-dev] About Clang llvm PGO

[llvm-dev] About Clang llvm PGO

Reasonably Related Threads