thr3ads.net - llvm dev - [llvm-dev] clang performing worse than gcc for this loop [Aug 2020]

If this information is useful, please help other people find it:
Share via:

Riyaz Puthiyapurayil via llvm-dev

2020-Aug-23 17:53 UTC

[llvm-dev] clang performing worse than gcc for this loop

I am analyzing a clang 10.0.0 vs gcc 7.3 performance difference that I can
reproduce in the following test.

unsigned foo(unsigned t1, unsigned t2, int count, int step) {
    unsigned tmp = 0;
    int state = 0;
   for (int i = 0 ; i < count ; i += step) {
        state++;
        if (state > 5)
            state = 0;
        if (state == 3)
            tmp += t2;
    }
    return  tmp;
}

Clang output is about 40% slower when the function is called with t2=7,
count=2000000000, step=3 (t1 is unimportant in this case as it is unused here).
The attached screenshot shows the `perf report` annotated assembly code from
clang and gcc (clang is on the left). Gcc generated code takes 0.512 sec vs
clang's 0.731 sec. The machine I am running is a Broadwell... Intel(R)
Xeon(R) CPU E5-2640 v4 @ 2.40GHz.


The code generated by gcc runs consistently faster for all values for `step` I
tried; in some cases, the performance difference is worse than 40% seen with the
aforementioned parameter values to `foo`. The code generated by clang is a
direct result of simplifycfg that eliminates the inner branches and replaces
them with `select` which is then lowered to the two `cmov` instructions.

The code generated by clang takes far fewer branches but executes more
instructions. `perf` reports 32.76% front-end cycles idle with the clang code
compared to 24.20% for gcc generated code. Clang generated code seems to perform
worse in branch-miss and icache events (as reported by `perf`). But it is not
clear why. Are the two back-to-back cmove instructions the reason? Any comments
on this?


[cid:image002.png at 01D67897.72235000]


-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20200823/da7c3f2a/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image002.png
Type: image/png
Size: 168040 bytes
Desc: image002.png
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20200823/da7c3f2a/attachment-0001.png>

Florian Hahn via llvm-dev

2020-Aug-23 20:03 UTC

head link

[llvm-dev] clang performing worse than gcc for this loop

Hi,

> On Aug 23, 2020, at 18:53, Riyaz Puthiyapurayil via llvm-dev <llvm-dev
at lists.llvm.org> wrote:
> 
> I am analyzing a clang 10.0.0 vs gcc 7.3 performance difference that I can
reproduce in the following test.
>  

It looks like Clang 11.0/trunk optimizes the example quite differently
(https://godbolt.org/z/P4Eo53). Does the same performance gap exist with Clang
11.0/trunk? If it does, it would probably be best to file a bug at
https://bugs.llvm.org. That way it is easy to track.

Cheers,
Florian
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20200823/beac9f4e/attachment.html>

Min-Yih Hsu via llvm-dev

2020-Aug-24 02:46 UTC

head link

[llvm-dev] clang performing worse than gcc for this loop

Hi,

While machine assembly might be a way to diagnose problems, another way would be
leveraging the Optimization Remark framework following the instructions here:
https://llvm.org/docs/Remarks.html <https://llvm.org/docs/Remarks.html>

Basically it will print out a bunch of message regarding whether an optimization
missed certain expectations. And telling you which part of the code it happened
as well.

Best,
Min
> On Aug 23, 2020, at 10:53 AM, Riyaz Puthiyapurayil via llvm-dev
<llvm-dev at lists.llvm.org> wrote:
> 
> I am analyzing a clang 10.0.0 vs gcc 7.3 performance difference that I can
reproduce in the following test.
>  
> unsigned foo(unsigned t1, unsigned t2, int count, int step) {
>     unsigned tmp = 0;
>     int state = 0;
>    for (int i = 0 ; i < count ; i += step) {
>         state++;
>         if (state > 5)
>             state = 0;
>         if (state == 3)
>             tmp += t2;
>     }
>     return  tmp;
> }
>  
> Clang output is about 40% slower when the function is called with t2=7,
count=2000000000, step=3 (t1 is unimportant in this case as it is unused here).
The attached screenshot shows the `perf report` annotated assembly code from
clang and gcc (clang is on the left). Gcc generated code takes 0.512 sec vs
clang’s 0.731 sec. The machine I am running is a Broadwell… Intel(R) Xeon(R) CPU
E5-2640 v4 @ 2.40GHz.
>  
>  
> The code generated by gcc runs consistently faster for all values for
`step` I tried; in some cases, the performance difference is worse than 40% seen
with the aforementioned parameter values to `foo`. The code generated by clang
is a direct result of simplifycfg that eliminates the inner branches and
replaces them with `select` which is then lowered to the two `cmov`
instructions.
>  
> The code generated by clang takes far fewer branches but executes more
instructions. `perf` reports 32.76% front-end cycles idle with the clang code
compared to 24.20% for gcc generated code. Clang generated code seems to perform
worse in branch-miss and icache events (as reported by `perf`). But it is not
clear why. Are the two back-to-back cmove instructions the reason? Any comments
on this?
>  
>  
> <image002.png>
>  
>  
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
<https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20200823/277baded/attachment.html>

Modi Mo via llvm-dev

2020-Aug-31 22:37 UTC

head link

[llvm-dev] clang performing worse than gcc for this loop

The GCC loop will execute a shorter loop 5 out of 6 iterations. When “state” is
0 through 5 the “jle 10” will be the only loop that’s executed. Branches in both
cases are completely predictable so they effectively don’t count as instructions
so looking at the other instructions its 9 for Clang and 7 for GCC in the short
iteration and 10 in the long, average of 7.5 so at a “educated guess” level the
difference makes sense.

That being said, with such a small micro-benchmark all the minutiae of
micro-architecture design will hit you in the face. The difference could arise
from function alignment, CPU front-end design, what branch predictor is being
used etc. If you can certainly, test this out on some different architectures to
see if the difference is persistent.

As Florian mentioned the codegen is also different in trunk, so try that out and
file a bug if the issue persists.

Modi

On 8/23/20, 7:47 PM, "llvm-dev on behalf of Min-Yih Hsu via llvm-dev"
<llvm-dev-bounces at lists.llvm.org<mailto:llvm-dev-bounces at
lists.llvm.org> on behalf of llvm-dev at lists.llvm.org<mailto:llvm-dev at
lists.llvm.org>> wrote:

Hi,

While machine assembly might be a way to diagnose problems, another way would be
leveraging the Optimization Remark framework following the instructions here:
https://llvm.org/docs/Remarks.html<https://urldefense.proofpoint.com/v2/url?u=https-3A__llvm.org_docs_Remarks.html&d=DwMFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=HlATw2CSJtDwZKMxZp741A&m=huFg8C2cR4Db0Ph-V6gIH98b-ICNpHYnpqy1ZEUtk3I&s=Z-pyVCDSv8XRLuPvCqqFGLRpj9IYZLosUsvoMwxp6Aw&e=>

Basically it will print out a bunch of message regarding whether an optimization
missed certain expectations. And telling you which part of the code it happened
as well.

Best,
Min


On Aug 23, 2020, at 10:53 AM, Riyaz Puthiyapurayil via llvm-dev <llvm-dev at
lists.llvm.org<mailto:llvm-dev at lists.llvm.org>> wrote:

I am analyzing a clang 10.0.0 vs gcc 7.3 performance difference that I can
reproduce in the following test.

unsigned foo(unsigned t1, unsigned t2, int count, int step) {
    unsigned tmp = 0;
    int state = 0;
   for (int i = 0 ; i < count ; i += step) {
        state++;
        if (state > 5)
            state = 0;
        if (state == 3)
            tmp += t2;
    }
    return  tmp;
}

Clang output is about 40% slower when the function is called with t2=7,
count=2000000000, step=3 (t1 is unimportant in this case as it is unused here).
The attached screenshot shows the `perf report` annotated assembly code from
clang and gcc (clang is on the left). Gcc generated code takes 0.512 sec vs
clang’s 0.731 sec. The machine I am running is a Broadwell… Intel(R) Xeon(R) CPU
E5-2640 v4 @ 2.40GHz.


The code generated by gcc runs consistently faster for all values for `step` I
tried; in some cases, the performance difference is worse than 40% seen with the
aforementioned parameter values to `foo`. The code generated by clang is a
direct result of simplifycfg that eliminates the inner branches and replaces
them with `select` which is then lowered to the two `cmov` instructions.

The code generated by clang takes far fewer branches but executes more
instructions. `perf` reports 32.76% front-end cycles idle with the clang code
compared to 24.20% for gcc generated code. Clang generated code seems to perform
worse in branch-miss and icache events (as reported by `perf`). But it is not
clear why. Are the two back-to-back cmove instructions the reason? Any comments
on this?


<image002.png>


_______________________________________________
LLVM Developers mailing list
llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev<https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.llvm.org_cgi-2Dbin_mailman_listinfo_llvm-2Ddev&d=DwMFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=HlATw2CSJtDwZKMxZp741A&m=huFg8C2cR4Db0Ph-V6gIH98b-ICNpHYnpqy1ZEUtk3I&s=WUceMHt3YvXZSLPvRRdDXPC3P6qUwFYGIUZ_AsyPTyo&e=>

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20200831/fe2f7eeb/attachment.html>

Apparently Analagous Threads

Search for more reasonably related threads

llvm dev - Aug 2020 - clang performing worse than gcc for this loop

[llvm-dev] clang performing worse than gcc for this loop

[llvm-dev] clang performing worse than gcc for this loop

[llvm-dev] clang performing worse than gcc for this loop

[llvm-dev] clang performing worse than gcc for this loop

Apparently Analagous Threads